[jira] Updated: (NUTCH-706) Url regex normalizer
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-706: Fix Version/s: (was: 1.1) Both variants of the substitution rule above break existing tests. More work will be needed to get a pattern which covers the case described by Meghna *and* is compatible with the existing test cases. Moving it to post-1.1 Url regex normalizer Key: NUTCH-706 URL: https://issues.apache.org/jira/browse/NUTCH-706 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Meghna Kukreja Priority: Minor Hey, I encountered the following problem while trying to crawl a site using nutch-trunk. In the file regex-normalize.xml, the following regex is used to remove session ids: pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern. This pattern also transforms a url, such as, newsId=2000484784794newsLang=en into newnewsLang=en (since it matches 'sId' in the 'newsId'), which is incorrect and hence does not get fetched. This expression needs to be changed to prevent this. Thanks, Meghna -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-706) Url regex normalizer
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851923#action_12851923 ] Ken Krugler commented on NUTCH-706: --- Two comments about this: 1. From my experiences with Nutch Bixo, I think that URL normalization ultimately needs to be more structured - ie first break the URL into pieces, then apply rules against the pieces. Trying to craft regular expressions to handle target cases leads to big, hairy, hard-to-understand strings. 2. URL normalization is something that makes a lot of sense for crawler-commons. If somebody from the Nutch side wants to define a target API, I could look at porting existing Bixo code to crawler-commons. Url regex normalizer Key: NUTCH-706 URL: https://issues.apache.org/jira/browse/NUTCH-706 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Meghna Kukreja Priority: Minor Hey, I encountered the following problem while trying to crawl a site using nutch-trunk. In the file regex-normalize.xml, the following regex is used to remove session ids: pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern. This pattern also transforms a url, such as, newsId=2000484784794newsLang=en into newnewsLang=en (since it matches 'sId' in the 'newsId'), which is incorrect and hence does not get fetched. This expression needs to be changed to prevent this. Thanks, Meghna -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 1.1 release?
Hey Guys, OK I'm finally getting around to this: I am going to push all the current 1.1 JIRA issues out and set their fix version to nil. Once I'm done with this, I'll wait 48 hrs to see if there is anything that anyone really wants to get into 1.1. So, please, take a look here [1] and make sure that if you wanted your issue into 1.1, that it's there. After 48 hours, I'll make one more announcement, and wait 24 hours before cutting the 1.1 RC and pushing to people.a.o for review. Here I go! Cheers, Chris [1] http://bit.ly/cNehBc On 3/9/10 10:54 AM, Andrzej Bialecki a...@getopt.org wrote: On 2010-03-09 18:17, Julien Nioche wrote: Hi Chris, Excellent idea! There have been quite a few changes since 1.0 and it's probably the right time to have a new release. +1. Let's just check JIRA and make sure we didn't forget anything important ... Not really a blocker but https://issues.apache.org/jira/browse/NUTCH-762 would be nice to have in 1.1, just needs a bit of reviewing / testing I suppose. Otherwise this can wait until after 1.1 I'll try to test it before the weekend. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Updated: (NUTCH-249) black- white list url filtering
[ https://issues.apache.org/jira/browse/NUTCH-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-249: Fix Version/s: (was: 1.1) - push out per http://bit.ly/c7tBv9 black- white list url filtering --- Key: NUTCH-249 URL: https://issues.apache.org/jira/browse/NUTCH-249 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Stefan Groschupf Assignee: Dennis Kubes Priority: Trivial Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch, bw.patch Existing url filter mechanisms need to process each url against each filter pattern. For very large filter sets this may be does not scale very well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-309) Uses commons logging Code Guards
[ https://issues.apache.org/jira/browse/NUTCH-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-309: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Uses commons logging Code Guards Key: NUTCH-309 URL: https://issues.apache.org/jira/browse/NUTCH-309 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Jerome Charron Assignee: Chris A. Mattmann Priority: Minor Code guards are typically used to guard code that only needs to execute in support of logging, that otherwise introduces undesirable runtime overhead in the general case (logging disabled). Examples are multiple parameters, or expressions (e.g. string + more) for parameters. Use the guard methods of the form log.isPriority() to verify that logging should be performed, before incurring the overhead of the logging method call. Yes, the logging methods will perform the same check, but only after resolving parameters. (description extracted from http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-763) Separate configuration files from resources to be included in the job file
[ https://issues.apache.org/jira/browse/NUTCH-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-763: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Separate configuration files from resources to be included in the job file -- Key: NUTCH-763 URL: https://issues.apache.org/jira/browse/NUTCH-763 Project: Nutch Issue Type: Wish Reporter: Julien Nioche Priority: Minor One of the things I found confusing when I was learning Nutch was the fact that the conf/ directory contains at the same time : - configuration files for Hadoop / Nutch which are put in the jar files but not used there - resource files (e.g. filtering rules) which MUST be up to date in the job file I would separate the conf/ directory from say a resources/ directory which would contain the rule files and other things to put in the job file. Unless I am mistaken none of the configuration files need to be in the job file. I know it is a very minor point, but that would probably simplify things and make it easier for beginners to understand what has to be modified where. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-577) Use explicit tika-config.xml file to enable mime magic detection to be turned on and off
[ https://issues.apache.org/jira/browse/NUTCH-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-577: Due Date: 30/Nov/07 (was: 30/Nov/07) Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Use explicit tika-config.xml file to enable mime magic detection to be turned on and off Key: NUTCH-577 URL: https://issues.apache.org/jira/browse/NUTCH-577 Project: Nutch Issue Type: Improvement Components: mime_type_detector Affects Versions: 1.0.0 Environment: Mac Book Pro Intel Core Duo 2.0 Ghz, 2. 0 GB RAM, Mac OS X 10.4, although improvement is indep. of env. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Currently, there is a configuration file for Tika (which the trunk in Nutch uses for its mime type detection) called tika-config.xml left unexposed (a default one lives in the tika-0.1-dev.jar file). Tika's mime system has two config files it relies on: tika-mimetypes.xml (which Nutch has its own version of, that overrides the version that comes with the tika jar file), and tika-config.xml (to turn on or off magic char detection). We should probably have a nutch version of tika-config.xml, so that Nutch users can employ magic char mime detection. I'll get going on this in the next day or so. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-310) Review Log Levels
[ https://issues.apache.org/jira/browse/NUTCH-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-310: Fix Version/s: (was: 1.1) Assignee: Chris A. Mattmann (was: Jerome Charron) - pushing this out per http://bit.ly/c7tBv9 (and assign to me, I think this can be closed but will wait until after 1.1 to revisit) Review Log Levels - Key: NUTCH-310 URL: https://issues.apache.org/jira/browse/NUTCH-310 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Jerome Charron Assignee: Chris A. Mattmann Priority: Minor Review of logs content and logs levels (see Commons Logging Best Parctices : http://jakarta.apache.org/commons/logging/guide.html#Message_Priorities_Levels) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-673: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Upgrade the Carrot2 plug-in to release 3.0 -- Key: NUTCH-673 URL: https://issues.apache.org/jira/browse/NUTCH-673 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.9.0 Environment: All Nutch deployments. Reporter: Sean Dean Priority: Minor Release 3.0 of the Carrot2 plug-in was released recently. We currently have version 2.1 in the source tree and upgrading it to the latest version before 1.0-release might make sence. Details on the release can be found here: http://project.carrot2.org/release-3.0-notes.html One major change in requirements is for JDK 1.5 to be used, but this is also now required for Hadoop 0.19 so this wouldnt be the only reason for the switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-664) Possibility to update already stored documents.
[ https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-664: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Possibility to update already stored documents. --- Key: NUTCH-664 URL: https://issues.apache.org/jira/browse/NUTCH-664 Project: Nutch Issue Type: Wish Reporter: Sergey Khilkov Priority: Minor We have huge index of stored documents. It is high cost procedure to fetch page, merge indexes any time we update some information about page. The information can be changed 1-3 times per day. At this moment we have to store changed info in database, but in this case we have lots of problems with sorting, search restricions and so on. Lucene itself allows delete single document and add new one into existing index. But there is a problem with hadoop... As I understand hadoop filesystem has no possibility to write in random positions. But it will be great feature if nutch will be able to update created index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-750) HtmlParser plugin - page title extraction
[ https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-750: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 HtmlParser plugin - page title extraction - Key: NUTCH-750 URL: https://issues.apache.org/jira/browse/NUTCH-750 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Alexey Torochkov Priority: Minor Attachments: SkipBody.patch A little improvement to trying to extract title tag in body if it doesn't exist in head. In current version DOMContentUtils just skip all after body in getTitle() method. Attached patch allows to change this behavior (for default it doesn't change anything) and can cope with webmasters mistakes -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-564) External parser supports encoding attribute
[ https://issues.apache.org/jira/browse/NUTCH-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-564: Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 External parser supports encoding attribute --- Key: NUTCH-564 URL: https://issues.apache.org/jira/browse/NUTCH-564 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Environment: All Reporter: Antony Bowesman Priority: Minor Attachments: ExtParser_0.9.0.patch, ExtParser_1.0.0.patch When an external component generates text, which is returned to the external parser, it always converts the text using the default character set. (os.toString()). For example, the returned text may be utf-8, but will not be converted to a String correctly. I added the attribute encoding to the implementation XML in plugin.xml and this is then used to convert the text. I have tested my original fix on my local 0.9 and include a patch, but have also made an untested patch for trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-477: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Extend URLFilters to support different filtering chains --- Key: NUTCH-477 URL: https://issues.apache.org/jira/browse/NUTCH-477 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Priority: Minor Attachments: urlfilters.patch I propose to make the following changes to URLFilters: * extend URLFilters so that they support different filtering rules depending on the context where they are executed. This functionality mirrors the one that URLNormalizers already support. * change their return value to an int code, in order to support early termination of long filtering chains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-251) Administration GUI
[ https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-251: Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 (comment from me: would be nice to get this into 1.2) Administration GUI -- Key: NUTCH-251 URL: https://issues.apache.org/jira/browse/NUTCH-251 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Minor Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, nutch_gui_plugins_v1.zip, nutch_gui_v1.patch Having a web based administration interface would help to make nutch administration and management much more user friendly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-609) Allow Plugins to be Loaded from Jar File(s)
[ https://issues.apache.org/jira/browse/NUTCH-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-609: Due Date: 13/Feb/08 (was: 13/Feb/08) Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Allow Plugins to be Loaded from Jar File(s) --- Key: NUTCH-609 URL: https://issues.apache.org/jira/browse/NUTCH-609 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Attachments: NUTCH-609-1-20080212.patch Currently plugins cannot be loaded from a jar file. Plugins must be unzipped in one or more directories specified by the plugin.folders config. I have been thinking about an extension to PluginRepository or PluginManifestParser (or both) that would allow plugins to packaged into multiple independent jar files and placed on the classpath. The system would search the classpath for resources with the correct folder name and would load any plugins in those jars. This functionality would be very useful in making the nutch core more flexible in terms of packaging. It would also help with web applications where we don't want to have a plugins directory included in the webapp. Thoughts so far are unzipping those plugin jars into a common temp directory before loading. Another option is using something like commons vfs to interact with the jar files. VFS essential uses a disk based temporary cache for jar files, so it is pretty much the same solution. What are everyone else's thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-794) Language Identification must use check the parse metadata for language values
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-794. - Resolution: Fixed @julien -- I think this issue has been fixed in Tika right? If not, feel free to reopen, or better yet, re-file the issue against a post 1.1 Nutch release. Thanks! Language Identification must use check the parse metadata for language values -- Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-794.patch The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. The lang is not stored in the metadata either. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-578: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Assignee: Dennis Kubes Attachments: crawl-urlfilter.txt, NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-540) some problem about the Nutch cache
[ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-540: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 some problem about the Nutch cache -- Key: NUTCH-540 URL: https://issues.apache.org/jira/browse/NUTCH-540 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9 Reporter: crossany Attachments: 1.gif, 1186733525.jpg I'am a chinese. I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error. I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-455) dedup on tokenized fields is faulty
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-455: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 dedup on tokenized fields is faulty --- Key: NUTCH-455 URL: https://issues.apache.org/jira/browse/NUTCH-455 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: IndexSearcherCacheWarm.patch (From LUCENE-252) nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting duplicates. The values from this field is cached by Lucene's FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example url in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty. Current FieldCache implementation does not respect tokenized fields , and as described above caches only terms. So in the situation that we are searching using url as the dedup field, when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the url (such as www or com) rather that the whole url. This prevents using tokenized fields in the dedup field. I have written a patch for lucene and attached it in http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the aforementioned issue about tokenized field caching. However building such a cache for about 1.5M documents takes 20+ secs. The code in IndexSearcher.translateHits() starts with if (dedupField != null) dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); and for the first call of search in IndexSearcher, cache is built. Long story short, i have written a patch against IndexSearcher, which in constructor warms-up the caches of wanted fields(configurable). I think we should vote for LUCENE-252, and then commit the above patch with the last version of lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-747) injectIndex metadatas and inherit these metadatas to all matching suburls
[ https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-747: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 injectIndex metadatas and inherit these metadatas to all matching suburls -- Key: NUTCH-747 URL: https://issues.apache.org/jira/browse/NUTCH-747 Project: Nutch Issue Type: Improvement Components: indexer, injector Reporter: Marko Bauhardt Attachments: index-metadata.patch, metadata.patch Hi. the following two patches supports + inject metadatas to url's into a metadatadb url.com TAB METAKEY : TAB METAVALUE TAB METAVALUE METAKEY : METAVALUE ... ... + updates the parse_data metadata from a shard and write the metadatas to all fetched urls that starts with an url from the metadatadb + this patch support's metadata to all matching suburls inheritance the second patch implements a index-metadata plugin. + this plugin extract all metadats from the parse_data of a shard and index it. which metadats you can configure in the plugin.properties. + to index for example the lang you have to configure the plugin.properties: lang=STORE,UNTOKENIZED + that means that the index plugin exract metadata values with key lang. if exists, all values are indexed stored and untokenized Example create start url's in /tmp/urls/start/urls.txt http://lucene.apache.org/nutch/apidocs-1.0/index.html http://lucene.apache.org/nutch/apidocs-0.9/index.html create metadata url's in /tmp/urls/metadata/urls.txt http://lucene.apache.org/nutch/apidocs-1.0/ version:1.0 http://lucene.apache.org/nutch/apidocs-0.9/ version:0.9 Inject Urls bin/nutch inject crawldb /tmp/urls/start/ bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb /tmp/urls/metadata/ Fetch Parse Update bin/nutch generate crawldb segments bin/nutch fetch segments/20090806105717/ bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb segments/20090806105717 bin/nutch updatedb crawldb/ segments/20090806105717/ Fetch Parse Update Again ... Index bin/nutch invertlinks linkdb -dir segments/ bin/nutch index index crawldb/ linkdb/ segments/20090806105717 segments/20090806110127 Check your Index All urls starting with http://lucene.apache.org/nutch/apidocs-1.0/ are indexed with version:1.0. All urls starting with http://lucene.apache.org/nutch/apidocs-0.9/ are indexed with version:0.9. This issue is some related to http://issues.apache.org/jira/browse/NUTCH-655 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-479) Support for OR queries
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-479: Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Support for OR queries -- Key: NUTCH-479 URL: https://issues.apache.org/jira/browse/NUTCH-479 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: nutch_0.9_OR.patch, or.patch, or.patch There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-677) Segment merge filering based on segment content
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-677: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Segment merge filering based on segment content --- Key: NUTCH-677 URL: https://issues.apache.org/jira/browse/NUTCH-677 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Marcin Okraszewski Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, SegmentMergeFilters.java I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data. The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-774) Retry interval in crawl date is set to 0
[ https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-774: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Retry interval in crawl date is set to 0 Key: NUTCH-774 URL: https://issues.apache.org/jira/browse/NUTCH-774 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Reinhard Schwab Assignee: Andrzej Bialecki Attachments: NUTCH-774.patch, NUTCH-774_2.patch When i fetch and parse a feed with the feed plugin, http://www.wachauclimbing.net/home/impressum-disclaimer/feed/ another crawl date is generated http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ after fetching a second round the dump in the crawl db still shows a retry interval with value 0. http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ Version: 7 Status: 2 (db_fetched) Fetch time: Wed Dec 02 12:48:22 CET 2009 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 0 seconds (0 days) Score: 1.084 Signature: db9ab2193924cd2d0b53113a500ca604 Metadata: _pst_: success(1), lastModified=0 a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in the method setFetchSchedule -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-460) RDF parser plugin
[ https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-460: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 RDF parser plugin - Key: NUTCH-460 URL: https://issues.apache.org/jira/browse/NUTCH-460 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Ricardo J. Méndez Attachments: rubyspider-rdf.zip I've written a couple plugins that I'd like to contribute. RDFLinkParseFilter looks for links on the pages that point towards RDF information, and tags the pages with metadata about the type of links they hold. RDFLinkIndexingFilter indexes said metadata. RDFParser parses RDF information from several possible formats using Jena, and extracts the links that the file points to as Outlinks so that they can be fetched as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-460) RDF parser plugin
[ https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-460: Patch Info: [Patch Available] - pushing this out per http://bit.ly/c7tBv9 RDF parser plugin - Key: NUTCH-460 URL: https://issues.apache.org/jira/browse/NUTCH-460 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Ricardo J. Méndez Attachments: rubyspider-rdf.zip I've written a couple plugins that I'd like to contribute. RDFLinkParseFilter looks for links on the pages that point towards RDF information, and tags the pages with metadata about the type of links they hold. RDFLinkIndexingFilter indexes said metadata. RDFParser parses RDF information from several possible formats using Jena, and extracts the links that the file points to as Outlinks so that they can be fetched as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist
[ https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-729: Due Date: 26/Mar/09 (was: 26/Mar/09) Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 NPE in FieldIndexer when BasicFields url doesn't exist -- Key: NUTCH-729 URL: https://issues.apache.org/jira/browse/NUTCH-729 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.9.0, 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: NUTCH-729-1-20090235.patch There is a NullPointerException during a logging call in FieldIndexer when there isn't a url for a document. Documents shouldn't be without urls but since the FieldIndexer doesn't validate fields it is possible for it to occur. Most often this happens when BasicFields is run with the wrong segments directory and doesn't complain. It could also occur if using the FieldIndexer to index things other than basic fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-573) Multiple Domains - Query Search
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-573: - pushing this out per http://bit.ly/c7tBv9 Multiple Domains - Query Search --- Key: NUTCH-573 URL: https://issues.apache.org/jira/browse/NUTCH-573 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Environment: All Reporter: Rajasekar Karthik Assignee: Enis Soztutar Attachments: multiTermQuery_v1.patch Searching multiple domains can be done on Lucene - nut not that efficiently on nutch. Query: +content:abc +(sitewww.aaa.com site:www.bbb.com) works on lucene but the same concept does not work on nutch. In Lucene, it works with org.apache.lucene.analysis.KeywordAnalyzer org.apache.lucene.analysis.standard.StandardAnalyzer but NOT on org.apache.lucene.analysis.SimpleAnalyzer Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? Just FYI, another solution (inefficient I believe) which seems to be working on nutch query -site:ccc.com -site:ddd.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-717) Make Nutch Solr integration easier
[ https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-717: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Make Nutch Solr integration easier -- Key: NUTCH-717 URL: https://issues.apache.org/jira/browse/NUTCH-717 Project: Nutch Issue Type: New Feature Reporter: Sami Siren Erik Hatcher proposed we should provide a full solr config dir to be used with Nutch-Solr. Now we only provide index schema. It would be considerably easier to setup nutch-solr if we provided the whole conf dir that you could use with solr like: java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-541) Index url field untokenized
[ https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-541: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Index url field untokenized --- Key: NUTCH-541 URL: https://issues.apache.org/jira/browse/NUTCH-541 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the untokenized version of the url field in some contexts : 1. For deleting duplicates by url (at search time). see NUTCH-455 2. For restricting the search to a certain url (may be used in the case of RSS search where each entry in the Rss is added as a distinct document with (possibly) same url ) query-url extends FieldQueryFilter so: Query: url:http://www.apache.org/ Parsed: url:http http-www http-www-apache www www-apache apache org Translated: +url:http-http-www http-www-http-www-apache http-www-apache-www www-www-apache www-apache apache org 3. for accessing a document(s) in the search servers in the search servers. (using query plugin) I suggest we add url as in index-basic and implement a query-url-untoken plugin. doc.add(new Field(url, url.toString(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, Field.Index.UN_TOKENIZED)); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-628: Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Attachments: domain_statistics_v2.patch, NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-650) Hbase Integration
[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-650: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Hbase Integration - Key: NUTCH-650 URL: https://issues.apache.org/jira/browse/NUTCH-650 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Doğacan Güney Attachments: hbase-integration_v1.patch, hbase_v2.patch, malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch This issue will track nutch/hbase integration -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-583) FeedParser empty links for items
[ https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-583: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 FeedParser empty links for items Key: NUTCH-583 URL: https://issues.apache.org/jira/browse/NUTCH-583 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar FeedParser in feed plugin just discards the item if it does not have link element. However Rss 2.0 does not necessitate the link element for each item. Moreover sometimes the link is given in the guid element which is a globally unique identifier for the item. I think we can search the url for an item first, then if it is still not found, we can use the feed's url, but with merging all the parse texts into one Parse object. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-666: Due Date: 27/Nov/08 (was: 27/Nov/08) Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-666: Patch Info: [Patch Available] Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-475) Adaptive crawl delay
[ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-475: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Adaptive crawl delay Key: NUTCH-475 URL: https://issues.apache.org/jira/browse/NUTCH-475 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Doğacan Güney Attachments: adaptive-delay_draft.patch Current fetcher implementation waits a default interval before making another request to the same server (if crawl-delay is not specified in robots.txt). IMHO, an adaptive implementation will be better. If the server is under little load and can server requests fast, then fetcher can ask for more pages in a given interval. Similarly, if the server is suffering from heavy load, fetcher can slow down(w.r.t that host), easing the load on the server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-771) Add WebGraph classes to the bin/nutch script
[ https://issues.apache.org/jira/browse/NUTCH-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-771: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 Add WebGraph classes to the bin/nutch script Key: NUTCH-771 URL: https://issues.apache.org/jira/browse/NUTCH-771 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All, shell script Reporter: Dennis Kubes Assignee: Dennis Kubes Currently the webgraph jobs are called on the command line by calling main methods on their classes. I propose to upgrade the bin/nutch shell script to allow calling these jobs as well. This would include the webgraphdb, linkrank, scoreupdater, and nodedumper jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852047#action_12852047 ] Chris A. Mattmann commented on NUTCH-673: - Folks: if you get time to put together a patch for 1.1 or feel that this should go into 1.1, please see: http://bit.ly/c7tBv9 and comment in the next 48 hrs... Upgrade the Carrot2 plug-in to release 3.0 -- Key: NUTCH-673 URL: https://issues.apache.org/jira/browse/NUTCH-673 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.9.0 Environment: All Nutch deployments. Reporter: Sean Dean Priority: Minor Release 3.0 of the Carrot2 plug-in was released recently. We currently have version 2.1 in the source tree and upgrading it to the latest version before 1.0-release might make sence. Details on the release can be found here: http://project.carrot2.org/release-3.0-notes.html One major change in requirements is for JDK 1.5 to be used, but this is also now required for Hadoop 0.19 so this wouldnt be the only reason for the switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852048#action_12852048 ] Chris A. Mattmann commented on NUTCH-789: - Folks, I'm going to put together an RC for Tika 0.7 and take care of JIRA now. Once I do that, we can try and close out this issue for 1.1. I should be able to do this before the 48 hr deadline I threw up for Nutch 1.1... Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: fetcher Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852101#action_12852101 ] Chris A. Mattmann commented on NUTCH-794: - Hey Julien, yepper, I posted an RC of Tika 0.7, see: http://bit.ly/c7FZRc. If the VOTE passes on that in say the next 72 hours, I will push out a Tika 0.7 release to the mirrors. If everyone is OK with that, we can release Nutch 1.1 after...thoughts? Language Identification must use check the parse metadata for language values -- Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-794.patch The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. The lang is not stored in the metadata either. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.