[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-794: -------------------------------- Description: The following HTML document : <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html> is rendered as the following xhtml by Tika : <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html> with the lang attribute getting lost. The lang is not stored in the metadata either. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore was: The following HTML document : <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html> is rendered as the following xhtml by Tika : <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html> with the lang attribute getting lost. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore Summary: Tika parser does identify lang attributes on html tag (was: Tika parser does not keep attributes on html tag) > Tika parser does identify lang attributes on html tag > ----------------------------------------------------- > > Key: NUTCH-794 > URL: https://issues.apache.org/jira/browse/NUTCH-794 > Project: Nutch > Issue Type: Bug > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.1 > > > The following HTML document : > <html lang="fi"><head>document 1 title</head><body>jotain > suomeksi</body></html> > is rendered as the following xhtml by Tika : > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 > titlejotain suomeksi</body></html> > with the lang attribute getting lost. The lang is not stored in the metadata > either. > I will open an issue on Tika and modify TestHTMLLanguageParser so that the > tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.