[ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834657#action_12834657
 ] 

Hudson commented on NUTCH-794:
------------------------------

Integrated in Nutch-trunk #1071 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1071/])
     : Language Identification must use check the parse metadata for language 
values


> Language Identification must use check the parse metadata for language values 
> ------------------------------------------------------------------------------
>
>                 Key: NUTCH-794
>                 URL: https://issues.apache.org/jira/browse/NUTCH-794
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.1
>
>         Attachments: NUTCH-794.patch
>
>
> The following HTML document : 
> <html lang="fi"><head>document 1 title</head><body>jotain 
> suomeksi</body></html>
> is rendered as the following xhtml by Tika : 
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";><head><title/></head><body>document 1 
> titlejotain suomeksi</body></html>
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to