[jira] [Updated] (NUTCH-897) Subcollection requires blacklist element
[ https://issues.apache.org/jira/browse/NUTCH-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-897: Attachment: NUTCH-897.patch Attached tested fix and if confirmed to work and not break existing configurations. Patch works for 1.3 and trunk. Subcollection requires blacklist element Key: NUTCH-897 URL: https://issues.apache.org/jira/browse/NUTCH-897 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2, 1.3, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.3, 2.0 Attachments: NUTCH-897.patch This is a very minor issue with in Subcollection.java. It throws an error if the (empty) blacklist element was omitted. I think it should either not silently fail in case of an omitted blacklist element or throw a decent error message that the blacklist element is required. The following exception gets thrown if the blacklist element is omitted in a subcollection block: 2010-09-06 13:32:30,438 INFO collection.CollectionManager - Instantiating CollectionManager 2010-09-06 13:32:30,438 INFO collection.CollectionManager - initializing CollectionManager 2010-09-06 13:32:30,451 INFO collection.CollectionManager - file has1 elements 2010-09-06 13:32:30,456 WARN collection.CollectionManager - Error occured:java.lang.NullPointerException 2010-09-06 13:32:30,469 WARN collection.CollectionManager - java.lang.NullPointerException 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.Subcollection.initialize(Subcollection.java:173) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.parse(CollectionManager.java:98) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:75) 2010-09-06 13:32:30,470 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:56) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.collection.CollectionManager.getCollectionManager(CollectionManager.java:115) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.addSubCollectionField(SubcollectionIndexingFilter.java:65) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.filter(SubcollectionIndexingFilter.java:71) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) 2010-09-06 13:32:30,471 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:134) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) 2010-09-06 13:32:30,472 WARN collection.CollectionManager - at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Clean up open legacy issues in Jira
Super +1 Markus -- I've tried over the past 9 months to do this periodically when I've rolled releases, but if everyone could take a look and close out really old or non-applicable bugs, that would be great! BTW, time is freeing up for me lately, so it might be time finally for the 1.3 release, if folks are cool with me RM'ing it :) Cheers, Chris On Apr 1, 2011, at 7:03 AM, Markus Jelsma wrote: Hi guys, There's an awful lot of legacy in Jira. I propose we close the bulk of the issues that deal with the old search server, very old plugins or really old code. Thoughts? Cheers, ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Closed] (NUTCH-973) Remove Segment Merger in 1.3
[ https://issues.apache.org/jira/browse/NUTCH-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-973. --- Resolution: Not A Problem You are right, let's leave it for now. It won't be a problem once we're on 2.0 anyway Remove Segment Merger in 1.3 Key: NUTCH-973 URL: https://issues.apache.org/jira/browse/NUTCH-973 Project: Nutch Issue Type: Task Reporter: Julien Nioche Priority: Minor Fix For: 1.3 The code for the segment merging is still in 1.3, as far as I understand its original function it was mostly useful for having a single data structure where the search app could get the cached data from. Now that we've delegated the indexing and search to SOLR we don't really need to worry about the cache anymore. Would it make sense to purge it or do you guys think it would still be useful? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-39) pagination in search result
[ https://issues.apache.org/jira/browse/NUTCH-39?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-39. -- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira pagination in search result --- Key: NUTCH-39 URL: https://issues.apache.org/jira/browse/NUTCH-39 Project: Nutch Issue Type: Improvement Components: web gui Environment: all Reporter: Jack Tang Priority: Trivial Now in nutch search.jsp, user navigate all search result using Next button. And google like pagination will feel better. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-36) Chinese in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-36?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-36. -- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Chinese in Nutch Key: NUTCH-36 URL: https://issues.apache.org/jira/browse/NUTCH-36 Project: Nutch Issue Type: Improvement Components: indexer, searcher Environment: all Reporter: Jack Tang Priority: Minor Attachments: #26700 Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-13) If dns points to 127.0.0.1, the url is also crawled
[ https://issues.apache.org/jira/browse/NUTCH-13?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-13. -- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira If dns points to 127.0.0.1, the url is also crawled --- Key: NUTCH-13 URL: https://issues.apache.org/jira/browse/NUTCH-13 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Matthias Jaekle Priority: Minor For example www.tik24.de points to 127.0.0.1. If you follow a link to www.tik24.de fetcher will crawl content from your own machine. Wrong DNS entries could create unwanted entries in segments. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-79) Fault tolerant searching.
[ https://issues.apache.org/jira/browse/NUTCH-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-79. -- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Fault tolerant searching. - Key: NUTCH-79 URL: https://issues.apache.org/jira/browse/NUTCH-79 Project: Nutch Issue Type: New Feature Components: searcher Reporter: Piotr Kosiorowski Attachments: patch I have finally managed to prepare first version of fault tolerant searching I have promised long time ago. It reads server configuration from search-groups.txt file (in startup directory or directory specified by searcher.dir) if no search-servers.txt file is present. If search-servers.txt is presentit would be read and handled as previously. --- Format of search-groups.txt: * pre * search.group.count=[int] * search.group.name.[i]=[string] (for i=0 to count-1) * * For each name: * [name].part.count=[int] partitionCount * [name].part.[i].host=[string] (for i=0 to partitionCount-1) * [name].part.[i].port=int (for i=0 to partitionCount-1) * * Example: * search.group.count=2 * search.group.name.0=master * search.group.name.1=backup * * master.part.count=2 * master.part.0.host=host1 * master.part.0.port= * master.part.1.host=host2 * master.part.1.port= * * backup.part.count=2 * backup.part.0.host=host3 * backup.part.0.port= * backup.part.1.host=host4 * backup.part.1.port= * /pre. If more than one search group is defined in configuration file requests are distributed among groups in round-robin fashion. If one of the servers from the group fails to respond the whole group is treated as inactive and removed from the pool used to distributed requests. There is a separate recovery thread that every searcher.recovery.delay seconds (default 60) tries to check if inactive became alive and if so adds it back to the pool of active groups. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-103) Vivisimo like treeview and url redirect
[ https://issues.apache.org/jira/browse/NUTCH-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-103. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Vivisimo like treeview and url redirect --- Key: NUTCH-103 URL: https://issues.apache.org/jira/browse/NUTCH-103 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.8 Environment: linux Reporter: robert benea Priority: Trivial Attachments: clusty.tar First, I modified cluster.jsp and now the cluster has a vivisimo look. I used javascript to show the treeview. Another small change is that I call the cluster recursively twice, so that two levels of clustering are shown. Second, I added redirect.jsp in order to log the links that were clicked during search and because of that search.jsp is changed as well. The code is not clean as all started as an experiment, I hope someone else finds it useful and clean it up ;-). To install it just copy the files where you deployed the nutch.war and will work auto-magically. Regards, R. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-18) Windows servers include illegal characters in URLs
[ https://issues.apache.org/jira/browse/NUTCH-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014581#comment-13014581 ] David Escuer commented on NUTCH-18: --- La persona amb la qui vol contactar estarà fora de les oficines de SIMPPLE des del 30 de març fins al 7 d'abril, ambdós inclosos. La persona con la que quiere contactar estará fuera de las oficinas de SIMPPLE desde el 30 de marzo hasta el 7 de abril, ambos incluidos. The person you are trying to reach will be out of the office from march 30 until april 7 (both included). Windows servers include illegal characters in URLs -- Key: NUTCH-18 URL: https://issues.apache.org/jira/browse/NUTCH-18 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Stefan Groschupf Priority: Minor Transfered from: http://sourceforge.net/tracker/index.php?func=detailaid=1110243group_id=59548atid=491356 submitted by: Ken Meltsner While spidering our intranet, I found that IIS may include illegal characters in URLs -- specifically, characters with the high bit set to produce non-English letters. In addition, both Firefox and IE will accept URLs with high- bit characters, but Java won't. While this may not be Nutch's (or Java's) fault, it would help if high-bit characters (and other illegal characters) in URLs could be escaped (using percent-hex notation) as part of the URL fix-up process, probably right after the hostname lower-case conversion. Example document name in Portuguese(with high-bit characters) taken from a longer URL: Nota%20tecnica%20-%20Alteração%20de% 20escopo.doc and with percent-escaped characters: Nota%20tecnica%20-%20Altera%e7%e3o%20de% 20escopo.doc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-104) Nutch query parser does not support CJK bi-gram segmentation.
[ https://issues.apache.org/jira/browse/NUTCH-104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-104. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Nutch query parser does not support CJK bi-gram segmentation. - Key: NUTCH-104 URL: https://issues.apache.org/jira/browse/NUTCH-104 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.6 Environment: all Reporter: Jack Tang Priority: Minor I customize one query filter using test as my field. And when i try to search test:(c1)(c2)(c3), the query object which is generated by NutchAnalysis is wrong. Now the result is test:(c1)(c2) [DEFAULT](c2)(c3). However, the expected result is test:(c1)(c2) (c2)(c3). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-180) Performance problem with widely used keywords
[ https://issues.apache.org/jira/browse/NUTCH-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-180. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Performance problem with widely used keywords - Key: NUTCH-180 URL: https://issues.apache.org/jira/browse/NUTCH-180 Project: Nutch Issue Type: Wish Components: searcher Reporter: Mike Alulin It looks like Nutch is very slow when the search phrase includes a few widely used keywords. For example I 1 2 3 4 5 6 7 8 9 0 typed without the quotes to Yahoo, Google, or MSN is processed in less than a second. Nutch on the other hand requires much more time for this even on smaller databases. For example this phrase made objectssearch.com think more than 1 minute although their DB is much smaller than DBs of the big 3 guys. On my test Nutch DB with only 3M pages this phrase took a few seconds to process. Unfortunately I do not know much about search algorithms, but it looks like Nutch do have some space to improve the search performance. The current implementation can be easily killed by a few search requests like this. Just a couple of dozen of such requests makes my server with 2 Opterons think for a minute or two with 100% CPU utilization. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-581) DistributedSearch does not update search servers added to search-servers.txt on the fly
[ https://issues.apache.org/jira/browse/NUTCH-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-581. --- DistributedSearch does not update search servers added to search-servers.txt on the fly --- Key: NUTCH-581 URL: https://issues.apache.org/jira/browse/NUTCH-581 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Reporter: Rohan Mehta Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-581-2.patch, UpdateSearch.patch DistributedSearch client updates the search servers added to the search-servers.txt file on the fly. This patch will updates the search servers on the fly and the client does not need a restart. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-877) Allow setting of slop values for non-quote phrase queries on query-basic plugin
[ https://issues.apache.org/jira/browse/NUTCH-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-877. --- Allow setting of slop values for non-quote phrase queries on query-basic plugin --- Key: NUTCH-877 URL: https://issues.apache.org/jira/browse/NUTCH-877 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.2 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.2 Attachments: NUTCH-877-1-20100809.patch Patch adds a configuration variable for setting slop values on phrase queries. The default slop value, which currently can't be changed through configuration, is Integer.MAX_VALUE. It produces something like this, which doesn't seem right to me. If you are searching for a phrase you usually want it within a certain distance: 2.9141337E-4 = weight(content:my phrase~2147483647 in 1029), product of: * 0.07163286 = queryWeight(content:my phrase~2147483647), product of: o 9.657982 = idf(content: my=13470 phrase=534) o 0.0074169594 = queryNorm This patch adds the query.phrase.slop configuration value to the nutch-default.xml file. It has a default setting of 5. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-265) Getting Clustered results in better form.
[ https://issues.apache.org/jira/browse/NUTCH-265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-265: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Getting Clustered results in better form. - Key: NUTCH-265 URL: https://issues.apache.org/jira/browse/NUTCH-265 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.7.2 Reporter: Kris K The cluster results are coming with title and link to URL. For improvement it should be clustered keyword phrases (Like Vivisimo type). Any person can share their views on it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-674) NutchBean doesn't check for searcher.dir existance.
[ https://issues.apache.org/jira/browse/NUTCH-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-674: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira NutchBean doesn't check for searcher.dir existance. --- Key: NUTCH-674 URL: https://issues.apache.org/jira/browse/NUTCH-674 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Environment: Looks like platform independent problem. Reporter: Kuba Kończyk If searcher.dir doesn't exists or it's not accessible, searcher will just continue and report that there is 0 hits found.It should throw an exception or log an error instead.As an starting point, there was a patch proposed some time ago on Nuch-dev: http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg09422.html to solve this problem. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-423) Add other index-basic fields as query plugins
[ https://issues.apache.org/jira/browse/NUTCH-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-423: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Add other index-basic fields as query plugins - Key: NUTCH-423 URL: https://issues.apache.org/jira/browse/NUTCH-423 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Reporter: stack Priority: Minor Attachments: other-index-basic-query-fields.patch The basic indexer plugin adds 'host', 'site', 'url', 'content', 'title', and 'anchor'. The query-basic plugin expands queries against the 'default' field to run against all basic indexer plugin fields. The query-url pluging adds query filtering on the 'url' field and query-site' on 'site'. This patch adds plugins to filter on the remainder: 'host', 'content', 'title', and 'anchor'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-47) Configure host filter to do wildcard prefixes - *.redhat.com
[ https://issues.apache.org/jira/browse/NUTCH-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-47: --- Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Configure host filter to do wildcard prefixes - *.redhat.com Key: NUTCH-47 URL: https://issues.apache.org/jira/browse/NUTCH-47 Project: Nutch Issue Type: Improvement Components: searcher Environment: Linux Reporter: byron miller Priority: Minor Right now you can configure the max results per host for query response, but that seems limited to exact host matches such as www.redhat.com. In many ways it would be nice to include the capability to match hosts by wildcard. For example search for redhat on mozdex.com: http://www.mozdex.com/search.jsp?query=redhat And you will see: www.apac.redhat.com www.europe.redhat.com www.in.redhat.com Could this be fixed so that *.redhat.com is under find more sources under redhat.com or something like that? I may be able to tweak the other processes, but i can envision a problem of people creating www1 www2 www3 or using other country codes for the same/similar content filling up pages of serps for what could be other relevent information. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-943) Search Results default dedup field site should be stored in index.
[ https://issues.apache.org/jira/browse/NUTCH-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-943: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Search Results default dedup field site should be stored in index. Key: NUTCH-943 URL: https://issues.apache.org/jira/browse/NUTCH-943 Project: Nutch Issue Type: Bug Components: indexer, searcher Affects Versions: 1.2 Reporter: Charan Malemarpuram Attachments: NUTCH-943.patch site is not configured as a stored field in SOLR schema. Search returns only two results always and had See More Hits button, even if the results are from different sites. See More Attached patch changes the default schema.xml config to store site field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-469: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira changes to geoPosition plugin to make it work on nutch 0.9 -- Key: NUTCH-469 URL: https://issues.apache.org/jira/browse/NUTCH-469 Project: Nutch Issue Type: Improvement Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Mike Schwartz Attachments: NUTCH-469-2007-05-09.txt.gz, geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip I have modified the geoPosition plugin (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9. (The code was built originally using nutch 0.7.) I'd like to contribute my changes back to the nutch project. I already communicated with the code's author (Matthias Jaekle), and he agrees with my mods. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-377) Add possibility to search for multiple values
[ https://issues.apache.org/jira/browse/NUTCH-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-377: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Add possibility to search for multiple values - Key: NUTCH-377 URL: https://issues.apache.org/jira/browse/NUTCH-377 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Stefan Neufeind Searches with boolean operators (AND or OR) are not (yet) possible. All search-items are always searched with AND. But it would be nice to have the possibility to allow multiple values for a certain field. Maybe that could done using a separator? As an example you might want to search for: somewordsite:www.example.org|www.apache.org Which (to my understand) would allow to search for one or more words with a restriction to those two sites. It would prevent having to implement AND and OR fully (maybe even including brackets) but would allow to cover a few often used cases imho. Easy/hard to do? To my understanding Lucene itself allows AND/OR-searches. So might basically be a problem of string-parsing and query-building towards Lucene? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-453) Move stop words to a config file
[ https://issues.apache.org/jira/browse/NUTCH-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-453: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Move stop words to a config file Key: NUTCH-453 URL: https://issues.apache.org/jira/browse/NUTCH-453 Project: Nutch Issue Type: Improvement Components: indexer, searcher Reporter: Steve Severance Priority: Minor Move the stop words from the code to a config file. This will allow the stop words to be modified without recompiling the code. The format could be the same as the regex-urlfilter where regexs are used to define the words or a plain text file of words could be used. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-542) Null Pointer Exception on getSummary when segment no longer exists
[ https://issues.apache.org/jira/browse/NUTCH-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-542: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Null Pointer Exception on getSummary when segment no longer exists -- Key: NUTCH-542 URL: https://issues.apache.org/jira/browse/NUTCH-542 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Environment: ubuntu, tomcat5.5 Reporter: Jeff V. Priority: Minor If the index refers to a search result in a given segment, but that segment directory does not exist (has been deleted for some reason) the search.jsp will return a completely blank page because a Null Pointer Exception is being thrown from getSummary. At the very least it would be nice to get a more friendly log message such as segment doesn't exist. But ideally the search should continue with just omitting the non-existent results. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-466) Flexible segment format
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-466: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Flexible segment format --- Key: NUTCH-466 URL: https://issues.apache.org/jira/browse/NUTCH-466 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: ParseFilters.java, segmentparts.patch In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys. Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment parts, with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios. Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts. Example applications: * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents * storing pre-tokenized version of plain text for faster snippet generation * storing linguistically tagged text for sophisticated data mining * storing image thumbnails etc, etc ... I'm going to prepare a patchset shortly. Any comments and suggestions are welcome. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-480) Searching multiple indexes with a single nutch instance
[ https://issues.apache.org/jira/browse/NUTCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-480: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Searching multiple indexes with a single nutch instance --- Key: NUTCH-480 URL: https://issues.apache.org/jira/browse/NUTCH-480 Project: Nutch Issue Type: Improvement Components: searcher, web gui Affects Versions: 0.8 Environment: Linux and Windows Reporter: Ravi Chintakunta Attachments: nutch.zip Searching across multiple indexes with a single instance of Nutch is a cool feature improvement. I had this requirement for my production site, where we wanted to list the available categories (indexes) to search as check boxes and the user could select any combination of indexes to search. The results page also displays the number of hits in each index. To do this: - I modified web.xml to include the paths to various search indexes - Modified Nutch.java to read all the indexes and create IndexReaders - Modified IndexSearcher.java to handle multiple IndexReaders In the attached file you will find the patch to the Nutch 0.8 code base and also the newly added files: - SearchServlet - a servlet that is the web interface for search. This is simplified version of jsp versions (without the i18n) and outputs the results in text, xml or json format. - SearchConstants - an interface for messages and constants Please note that the patch includes the functionality for spell check - aka Did you mean? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-470) Adding optional terms to a query
[ https://issues.apache.org/jira/browse/NUTCH-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-470: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Adding optional terms to a query Key: NUTCH-470 URL: https://issues.apache.org/jira/browse/NUTCH-470 Project: Nutch Issue Type: Wish Components: searcher Affects Versions: 0.9.0 Environment: Any Reporter: Trond Andersen Priority: Minor Attachments: optional.patch I'm missing API to add optional terms in the query class. Made a small adjustment to the API to support this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-541) Index url field untokenized
[ https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-541: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Index url field untokenized --- Key: NUTCH-541 URL: https://issues.apache.org/jira/browse/NUTCH-541 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the untokenized version of the url field in some contexts : 1. For deleting duplicates by url (at search time). see NUTCH-455 2. For restricting the search to a certain url (may be used in the case of RSS search where each entry in the Rss is added as a distinct document with (possibly) same url ) query-url extends FieldQueryFilter so: Query: url:http://www.apache.org/ Parsed: url:http http-www http-www-apache www www-apache apache org Translated: +url:http-http-www http-www-http-www-apache http-www-apache-www www-www-apache www-apache apache org 3. for accessing a document(s) in the search servers in the search servers. (using query plugin) I suggest we add url as in index-basic and implement a query-url-untoken plugin. doc.add(new Field(url, url.toString(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, Field.Index.UN_TOKENIZED)); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-72) Query basic filter with correction feature
[ https://issues.apache.org/jira/browse/NUTCH-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-72: --- Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Query basic filter with correction feature -- Key: NUTCH-72 URL: https://issues.apache.org/jira/browse/NUTCH-72 Project: Nutch Issue Type: New Feature Components: searcher Environment: lucene Reporter: Christophe Noel Attachments: querycorrectionplugin.zip This plugin improves query-basic plugin with a correction feature. Lucene includes FuzzyQuery feature which consists of searching not only for matching terms, but searching for very similar terms too. This plugin should be used instead of query-basic, for people looking for an easy solution about users query requests correction. Correction Query Plugin can be used as follows : Solution 1 : If you want to search for very similar terms, add autocorrectionmod as the first term of the query (example : 'nutch engine' - 'autocorrectionmod nutch engine') Solution 2 : Create a new search.jsp page which include a correction checkbox management (input type=checkbox name=autocorrection value=true may automatically add 'autocorrectionmod' as the first term of the query) QueryFuzzy knows a big problem : it is very slow for large index ! So Correction Query Plugin works as follows : - it is not useful for big indexes - it only works for 5 characters and more words - it only look for words matching with the 2 first characters (to improve performance this should be set to 3/4) - it only works for 65 % matching suffixes (algorithm is levenstein) PLease give your opinion about it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-260) Three new plugins that parse, index and query meta tags defined in the configuration
[ https://issues.apache.org/jira/browse/NUTCH-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-260: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Three new plugins that parse, index and query meta tags defined in the configuration Key: NUTCH-260 URL: https://issues.apache.org/jira/browse/NUTCH-260 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.7.2 Environment: Built and tested on Linux so far. Reporter: Jake Vanderdray Priority: Minor Attachments: nutch_customizations.tar These plugins allow you to define meta tags in you're nutch-site file that you want to include in parseing, indexing and searching. The query plugin must replace query-basic. The format for adding query terms to nutch-site.xml is: property namemeta.names/name valuekeywords,recommended/value descriptionThis is a comma seperated list of meta tag names that will be parsed, indexed and searched against when parse-meta, index-meta and query-meta are used./description /property property namemeta.boosts/name value1.0,5.0/value descriptionComma seperated list of boost values when searching using query-meta. The order of the values should match the order of meta.names. /description /property Meta tags found are assumed to have either a single value or be a comma seperated list of values. The values found are added to the index as lucene keywords (i.e. meta name=keywords values=First Thing, Second Thing would result in two keyword fields named keywords. The first would countain First Thing and the second would contain Second Thing). I had to replace the query-basic plugin in order to allow matches in the meta fields to return hits even if there were no matches in any of the default fields. The query-basic field only returns hits when every search term is found in at least one default field. I needed hits returned if matches were found in at least one field for every term, and/or the entire search phrase appeared in a meta index field. One known bug is that common terms are not getting stripped out of the fields' values before they get indexed, so The Next Big Thing could not be matched because the query engine will strip out the from all queries. I intend to fix this by stipping out common terms from meta fields before indexing them. Another issue is that searching for Next Big Thing would not match meta index values for Next, Big or Thing. You can consider that a bug or a feature depending on how you look at it. These plugins were written for and only work on the 0.7.2 branch. I'm going to attache a tarball of the source of these three plugins after I create the issue. To use the plugins, you'll need to untar them in your src/plugins directory and add them to the ant build.xml directive (and of course add them in your nutch-site.xml file). If these end up getting added to the project, I'll write up documentation on the wiki. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-445: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Domain İndexing / Query Filter -- Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: TranslatingRawFieldQueryFilter_v1.0.patch, index_query_domain_v1.0.patch, index_query_domain_v1.1.patch, index_query_domain_v1.2.patch Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * br For http://lucene.apache.org/nutch/ the * following will be added to the index : br * ul * lilucene.apache.org /li * liapache/li * liorg /li * /ul * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org br * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-820) Infinite loop when hitspersite is set
[ https://issues.apache.org/jira/browse/NUTCH-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-820: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Infinite loop when hitspersite is set - Key: NUTCH-820 URL: https://issues.apache.org/jira/browse/NUTCH-820 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Reporter: Xiao Yang NutchBean will re-search over and over, when the page number become large and the excluded sites exceed MAX_PROHIBITED_TERMS. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-92) DistributedSearch incorrectly scores results
[ https://issues.apache.org/jira/browse/NUTCH-92?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-92: --- Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira DistributedSearch incorrectly scores results Key: NUTCH-92 URL: https://issues.apache.org/jira/browse/NUTCH-92 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: distributed-idf-v2.patch, distributed-idf.patch When running search servers in a distributed setup, using DistributedSearch$Server and Client, total scores are incorrectly calculated. The symptoms are that scores differ depending on how segments are deployed to Servers, i.e. if there is uneven distribution of terms in segment indexes (due to segment size or content differences) then scores will differ depending on how many and which segments are deployed on a particular Server. This may lead to prioritizing of non-relevant results over more relevant ones. The underlying reason for this is that each IndexSearcher (which uses local index on each Server) calculates scores based on the local IDFs of query terms, and not the global IDFs from all indexes together. This means that scores arriving from different Servers to the Client cannot be meaningfully compared, unless all indexes have similar distribution of Terms and similar numbers of documents in them. However, currently the Client mixes all scores together, sorts them by absolute values and picks top hits. These absolute values will change if segments are un-evenly deployed to Servers. Currently the workaround is to deploy the same number of documents in segments per Server, and to ensure that segments contain well-randomized content so that term frequencies for common terms are very similar. The solution proposed here (as a result of discussion between ab and cutting, patches are coming) is to calculate global IDFs prior to running the query, and pre-boost query Terms with these global IDFs. This will require one more RPC call per each query (this can be optimized later, e.g. through caching). Then the scores will become normalized according to the global IDFs, and Client will be able to meaningfully compare them. Scores will also become independent of the segment content or local number of documents per Server. This will involve at least the following changes: * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This enables us to manipulate scores independently of local IDFs. * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which will return document frequencies for query terms. * modify getSegmentNames() so that it returns also the total number of documents in each segment, or implement this as a separate method (this will be called once during segment init) * in DistributedSearch$Client.search() first make a call to servers to return local IDFs for the current query, and calculate global IDFs for each relevant Term in that query. * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for all of its terms This solution should be applicable with only minor changes to all branches, but initially the patches will be relative to trunk/ . Comments, suggestions and review are welcome! -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-573) Multiple Domains - Query Search
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-573: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Multiple Domains - Query Search --- Key: NUTCH-573 URL: https://issues.apache.org/jira/browse/NUTCH-573 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Environment: All Reporter: Rajasekar Karthik Assignee: Enis Soztutar Attachments: multiTermQuery_v1.patch Searching multiple domains can be done on Lucene - nut not that efficiently on nutch. Query: +content:abc +(sitewww.aaa.com site:www.bbb.com) works on lucene but the same concept does not work on nutch. In Lucene, it works with org.apache.lucene.analysis.KeywordAnalyzer org.apache.lucene.analysis.standard.StandardAnalyzer but NOT on org.apache.lucene.analysis.SimpleAnalyzer Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? Just FYI, another solution (inefficient I believe) which seems to be working on nutch query -site:ccc.com -site:ddd.com -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-764) Add support for vfsfile:// loading of plugins for JBoss
[ https://issues.apache.org/jira/browse/NUTCH-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-764: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Add support for vfsfile:// loading of plugins for JBoss --- Key: NUTCH-764 URL: https://issues.apache.org/jira/browse/NUTCH-764 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Environment: JBoss AS 5.1.0 Reporter: tcur...@approachingpi.com Priority: Trivial In the file: /src/java/org/apache/nutch/plugin/PluginManifestParser.java There is a check to make sure that the plugin file location is a url formatted like file://path/plugins. When deployed on Jboss, the file protocol will sometimes be: vfsfile://path/plugins. The code with vfsfile can operate the same so I propose a change to the check to also allow this protocol. This would allow Nutch to be deployed on the newer versions of JBoss without any modification. Here is a simple patch: Index: src/java/org/apache/nutch/plugin/PluginManifestParser.java === --- src/java/org/apache/nutch/plugin/PluginManifestParser.javaMon Nov 09 20:20:51 EST 2009 +++ src/java/org/apache/nutch/plugin/PluginManifestParser.javaMon Nov 09 20:20:51 EST 2009 @@ -121,7 +121,8 @@ } else if (url == null) { LOG.warn(Plugins: directory not found: + name); return null; - } else if (!file.equals(url.getProtocol())) { + } else if (!file.equals(url.getProtocol()) +!vfsfile.equals(url.getProtocol())) { LOG.warn(Plugins: not a file: url. Can't load plugins from: + url); return null; } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-455) dedup on tokenized fields is faulty
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-455: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira dedup on tokenized fields is faulty --- Key: NUTCH-455 URL: https://issues.apache.org/jira/browse/NUTCH-455 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: IndexSearcherCacheWarm.patch (From LUCENE-252) nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting duplicates. The values from this field is cached by Lucene's FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example url in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty. Current FieldCache implementation does not respect tokenized fields , and as described above caches only terms. So in the situation that we are searching using url as the dedup field, when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the url (such as www or com) rather that the whole url. This prevents using tokenized fields in the dedup field. I have written a patch for lucene and attached it in http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the aforementioned issue about tokenized field caching. However building such a cache for about 1.5M documents takes 20+ secs. The code in IndexSearcher.translateHits() starts with if (dedupField != null) dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); and for the first call of search in IndexSearcher, cache is built. Long story short, i have written a patch against IndexSearcher, which in constructor warms-up the caches of wanted fields(configurable). I think we should vote for LUCENE-252, and then commit the above patch with the last version of lucene. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-708) NutchBean: OOM due to searcher.max.hits and dedup.
[ https://issues.apache.org/jira/browse/NUTCH-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-708: Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira NutchBean: OOM due to searcher.max.hits and dedup. -- Key: NUTCH-708 URL: https://issues.apache.org/jira/browse/NUTCH-708 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Environment: Ubuntu Linux, Java 5. Reporter: Aaron Binns When searching an index we built for the National Archives, this one in particular: http://webharvest.gov/collections/congress110th/ We ran into an interesting situation. We were using searcher.max.hits=1000 in order to get faster searches. Since our index is sorted, the best documents are at the front and setting searcher.max.hits=1000 would give us a nice trade-off of search quality vs. response time. What I discovered was that with dedup (on site) enabled, we would get into this loop where the searcher.max.hits would limit the raw hits to 1000 and the deduplication code would get to the end of those 1000 results and still need more as it hadn't found enough de-dup'd results to satisfy the query. The first 6 pages of results would be fine, but when we got to page 7, the NutchBean would need more than 1000 raw results in order to get 60 de-duped results. The code: for (int rawHitNum = 0; rawHitNum hits.getTotal(); rawHitNum++) { // get the next raw hit if (rawHitNum = hits.getLength()) { // optimize query by prohibiting more matches on some excluded values Query optQuery = (Query)query.clone(); for (int i = 0; i excludedValues.size(); i++) { if (i == MAX_PROHIBITED_TERMS) break; optQuery.addProhibitedTerm(((String)excludedValues.get(i)), dedupField); } numHitsRaw = (int)(numHitsRaw * rawHitsFactor); if (LOG.isInfoEnabled()) { LOG.info(re-searching for +numHitsRaw+ raw hits, query: +optQuery); } hits = searcher.search(optQuery, numHitsRaw, dedupField, sortField, reverse); if (LOG.isInfoEnabled()) { LOG.info(found +hits.getTotal()+ raw hits); } rawHitNum = -1; continue; } The loop constraints were never satisfied as rawHitNum and hits.getLength() are capped by searcher.max.hits (1000). The numHitsRaw keeps increasing by a factor of 2 (rawHitsFactor) until it gets to 2^31 or so and deep down in the search library code an array is allocated using that value as the size and you get an OOM. We worked around the problem by abandoning the use of searcher.max.hits. I suppose we could have increased the value, but the index was small enough (~10GB) that disabling searcher.max.hits didn't degrade the response time too much. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-72) Query basic filter with correction feature
[ https://issues.apache.org/jira/browse/NUTCH-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-72. -- Resolution: Won't Fix Query basic filter with correction feature -- Key: NUTCH-72 URL: https://issues.apache.org/jira/browse/NUTCH-72 Project: Nutch Issue Type: New Feature Components: searcher Environment: lucene Reporter: Christophe Noel Attachments: querycorrectionplugin.zip This plugin improves query-basic plugin with a correction feature. Lucene includes FuzzyQuery feature which consists of searching not only for matching terms, but searching for very similar terms too. This plugin should be used instead of query-basic, for people looking for an easy solution about users query requests correction. Correction Query Plugin can be used as follows : Solution 1 : If you want to search for very similar terms, add autocorrectionmod as the first term of the query (example : 'nutch engine' - 'autocorrectionmod nutch engine') Solution 2 : Create a new search.jsp page which include a correction checkbox management (input type=checkbox name=autocorrection value=true may automatically add 'autocorrectionmod' as the first term of the query) QueryFuzzy knows a big problem : it is very slow for large index ! So Correction Query Plugin works as follows : - it is not useful for big indexes - it only works for 5 characters and more words - it only look for words matching with the 2 first characters (to improve performance this should be set to 3/4) - it only works for 65 % matching suffixes (algorithm is levenstein) PLease give your opinion about it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-294) Topic-maps of related searchwords
[ https://issues.apache.org/jira/browse/NUTCH-294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-294. --- Resolution: Won't Fix Topic-maps of related searchwords - Key: NUTCH-294 URL: https://issues.apache.org/jira/browse/NUTCH-294 Project: Nutch Issue Type: New Feature Components: searcher Reporter: Stefan Neufeind Would it be possible to offer a user topic-maps? It's when you search for something and get topic-related words that might also be of interest for you. I wonder if that's somehow possible with the ngram-index for did you mean (see separate feature-enhancement-bug for this), but we'd need to have a relation between words (in what context do they occur). For the webfrontend usually trees are used - which for some users offer quite impressive eye-candy :-) E.g. see this advertisement by Novell where I've just seen a similar topic-map as well: http://www.novell.com/de-de/company/advertising/defineyouropen.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-943) Search Results default dedup field site should be stored in index.
[ https://issues.apache.org/jira/browse/NUTCH-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-943. --- Resolution: Won't Fix Search Results default dedup field site should be stored in index. Key: NUTCH-943 URL: https://issues.apache.org/jira/browse/NUTCH-943 Project: Nutch Issue Type: Bug Components: indexer, searcher Affects Versions: 1.2 Reporter: Charan Malemarpuram Attachments: NUTCH-943.patch site is not configured as a stored field in SOLR schema. Search returns only two results always and had See More Hits button, even if the results are from different sites. See More Attached patch changes the default schema.xml config to store site field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-540) some problem about the Nutch cache
[ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-540. --- Resolution: Won't Fix some problem about the Nutch cache -- Key: NUTCH-540 URL: https://issues.apache.org/jira/browse/NUTCH-540 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9 Reporter: crossany Attachments: 1.gif, 1186733525.jpg I'am a chinese. I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error. I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-469. --- Resolution: Won't Fix changes to geoPosition plugin to make it work on nutch 0.9 -- Key: NUTCH-469 URL: https://issues.apache.org/jira/browse/NUTCH-469 Project: Nutch Issue Type: Improvement Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Mike Schwartz Attachments: NUTCH-469-2007-05-09.txt.gz, geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip I have modified the geoPosition plugin (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9. (The code was built originally using nutch 0.7.) I'd like to contribute my changes back to the nutch project. I already communicated with the code's author (Matthias Jaekle), and he agrees with my mods. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-92) DistributedSearch incorrectly scores results
[ https://issues.apache.org/jira/browse/NUTCH-92?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-92. -- Resolution: Won't Fix DistributedSearch incorrectly scores results Key: NUTCH-92 URL: https://issues.apache.org/jira/browse/NUTCH-92 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: distributed-idf-v2.patch, distributed-idf.patch When running search servers in a distributed setup, using DistributedSearch$Server and Client, total scores are incorrectly calculated. The symptoms are that scores differ depending on how segments are deployed to Servers, i.e. if there is uneven distribution of terms in segment indexes (due to segment size or content differences) then scores will differ depending on how many and which segments are deployed on a particular Server. This may lead to prioritizing of non-relevant results over more relevant ones. The underlying reason for this is that each IndexSearcher (which uses local index on each Server) calculates scores based on the local IDFs of query terms, and not the global IDFs from all indexes together. This means that scores arriving from different Servers to the Client cannot be meaningfully compared, unless all indexes have similar distribution of Terms and similar numbers of documents in them. However, currently the Client mixes all scores together, sorts them by absolute values and picks top hits. These absolute values will change if segments are un-evenly deployed to Servers. Currently the workaround is to deploy the same number of documents in segments per Server, and to ensure that segments contain well-randomized content so that term frequencies for common terms are very similar. The solution proposed here (as a result of discussion between ab and cutting, patches are coming) is to calculate global IDFs prior to running the query, and pre-boost query Terms with these global IDFs. This will require one more RPC call per each query (this can be optimized later, e.g. through caching). Then the scores will become normalized according to the global IDFs, and Client will be able to meaningfully compare them. Scores will also become independent of the segment content or local number of documents per Server. This will involve at least the following changes: * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This enables us to manipulate scores independently of local IDFs. * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which will return document frequencies for query terms. * modify getSegmentNames() so that it returns also the total number of documents in each segment, or implement this as a separate method (this will be called once during segment init) * in DistributedSearch$Client.search() first make a call to servers to return local IDFs for the current query, and calculate global IDFs for each relevant Term in that query. * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for all of its terms This solution should be applicable with only minor changes to all branches, but initially the patches will be relative to trunk/ . Comments, suggestions and review are welcome! -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-674) NutchBean doesn't check for searcher.dir existance.
[ https://issues.apache.org/jira/browse/NUTCH-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-674. --- Resolution: Won't Fix NutchBean doesn't check for searcher.dir existance. --- Key: NUTCH-674 URL: https://issues.apache.org/jira/browse/NUTCH-674 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Environment: Looks like platform independent problem. Reporter: Kuba Kończyk If searcher.dir doesn't exists or it's not accessible, searcher will just continue and report that there is 0 hits found.It should throw an exception or log an error instead.As an starting point, there was a patch proposed some time ago on Nuch-dev: http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg09422.html to solve this problem. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-820) Infinite loop when hitspersite is set
[ https://issues.apache.org/jira/browse/NUTCH-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-820. --- Resolution: Won't Fix Infinite loop when hitspersite is set - Key: NUTCH-820 URL: https://issues.apache.org/jira/browse/NUTCH-820 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Reporter: Xiao Yang NutchBean will re-search over and over, when the page number become large and the excluded sites exceed MAX_PROHIBITED_TERMS. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-708) NutchBean: OOM due to searcher.max.hits and dedup.
[ https://issues.apache.org/jira/browse/NUTCH-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-708. --- Resolution: Won't Fix NutchBean: OOM due to searcher.max.hits and dedup. -- Key: NUTCH-708 URL: https://issues.apache.org/jira/browse/NUTCH-708 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Environment: Ubuntu Linux, Java 5. Reporter: Aaron Binns When searching an index we built for the National Archives, this one in particular: http://webharvest.gov/collections/congress110th/ We ran into an interesting situation. We were using searcher.max.hits=1000 in order to get faster searches. Since our index is sorted, the best documents are at the front and setting searcher.max.hits=1000 would give us a nice trade-off of search quality vs. response time. What I discovered was that with dedup (on site) enabled, we would get into this loop where the searcher.max.hits would limit the raw hits to 1000 and the deduplication code would get to the end of those 1000 results and still need more as it hadn't found enough de-dup'd results to satisfy the query. The first 6 pages of results would be fine, but when we got to page 7, the NutchBean would need more than 1000 raw results in order to get 60 de-duped results. The code: for (int rawHitNum = 0; rawHitNum hits.getTotal(); rawHitNum++) { // get the next raw hit if (rawHitNum = hits.getLength()) { // optimize query by prohibiting more matches on some excluded values Query optQuery = (Query)query.clone(); for (int i = 0; i excludedValues.size(); i++) { if (i == MAX_PROHIBITED_TERMS) break; optQuery.addProhibitedTerm(((String)excludedValues.get(i)), dedupField); } numHitsRaw = (int)(numHitsRaw * rawHitsFactor); if (LOG.isInfoEnabled()) { LOG.info(re-searching for +numHitsRaw+ raw hits, query: +optQuery); } hits = searcher.search(optQuery, numHitsRaw, dedupField, sortField, reverse); if (LOG.isInfoEnabled()) { LOG.info(found +hits.getTotal()+ raw hits); } rawHitNum = -1; continue; } The loop constraints were never satisfied as rawHitNum and hits.getLength() are capped by searcher.max.hits (1000). The numHitsRaw keeps increasing by a factor of 2 (rawHitsFactor) until it gets to 2^31 or so and deep down in the search library code an array is allocated using that value as the size and you get an OOM. We worked around the problem by abandoning the use of searcher.max.hits. I suppose we could have increased the value, but the index was small enough (~10GB) that disabling searcher.max.hits didn't degrade the response time too much. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-445. --- Resolution: Won't Fix Domain İndexing / Query Filter -- Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: TranslatingRawFieldQueryFilter_v1.0.patch, index_query_domain_v1.0.patch, index_query_domain_v1.1.patch, index_query_domain_v1.2.patch Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * br For http://lucene.apache.org/nutch/ the * following will be added to the index : br * ul * lilucene.apache.org /li * liapache/li * liorg /li * /ul * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org br * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-541) Index url field untokenized
[ https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-541. --- Resolution: Won't Fix Index url field untokenized --- Key: NUTCH-541 URL: https://issues.apache.org/jira/browse/NUTCH-541 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the untokenized version of the url field in some contexts : 1. For deleting duplicates by url (at search time). see NUTCH-455 2. For restricting the search to a certain url (may be used in the case of RSS search where each entry in the Rss is added as a distinct document with (possibly) same url ) query-url extends FieldQueryFilter so: Query: url:http://www.apache.org/ Parsed: url:http http-www http-www-apache www www-apache apache org Translated: +url:http-http-www http-www-http-www-apache http-www-apache-www www-www-apache www-apache apache org 3. for accessing a document(s) in the search servers in the search servers. (using query plugin) I suggest we add url as in index-basic and implement a query-url-untoken plugin. doc.add(new Field(url, url.toString(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, Field.Index.UN_TOKENIZED)); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-455) dedup on tokenized fields is faulty
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-455. --- Resolution: Won't Fix dedup on tokenized fields is faulty --- Key: NUTCH-455 URL: https://issues.apache.org/jira/browse/NUTCH-455 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: IndexSearcherCacheWarm.patch (From LUCENE-252) nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting duplicates. The values from this field is cached by Lucene's FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example url in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty. Current FieldCache implementation does not respect tokenized fields , and as described above caches only terms. So in the situation that we are searching using url as the dedup field, when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the url (such as www or com) rather that the whole url. This prevents using tokenized fields in the dedup field. I have written a patch for lucene and attached it in http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the aforementioned issue about tokenized field caching. However building such a cache for about 1.5M documents takes 20+ secs. The code in IndexSearcher.translateHits() starts with if (dedupField != null) dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); and for the first call of search in IndexSearcher, cache is built. Long story short, i have written a patch against IndexSearcher, which in constructor warms-up the caches of wanted fields(configurable). I think we should vote for LUCENE-252, and then commit the above patch with the last version of lucene. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-638) Launching Distributed Searchers with URI indicating filesystem to use rather than relying on hadoop config files.
[ https://issues.apache.org/jira/browse/NUTCH-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-638. --- Resolution: Won't Fix Launching Distributed Searchers with URI indicating filesystem to use rather than relying on hadoop config files. - Key: NUTCH-638 URL: https://issues.apache.org/jira/browse/NUTCH-638 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Aaron Nall Priority: Minor Attachments: distributed-search-uri.patch Original Estimate: 0.25h Remaining Estimate: 0.25h I wanted to conduct all index creation operations in hdfs but search from the local file system using the same cluster of machines. I believe that this is a common use case. This required either a parallel nutch install or edits (scripted or manual) to hadoop-site.xml to change the file system from hdfs to local when starting a distributed searcher service. This minor patch makes IndexSearcher and NutchBean honor URIs as supported by hadoop 0.17 without altering existing functionality if simple paths are entered. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-479) Support for OR queries
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-479. --- Resolution: Won't Fix Support for OR queries -- Key: NUTCH-479 URL: https://issues.apache.org/jira/browse/NUTCH-479 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: nutch_0.9_OR.patch, or.patch, or.patch There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-466) Flexible segment format
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-466. --- Resolution: Won't Fix Flexible segment format --- Key: NUTCH-466 URL: https://issues.apache.org/jira/browse/NUTCH-466 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: ParseFilters.java, segmentparts.patch In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys. Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment parts, with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios. Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts. Example applications: * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents * storing pre-tokenized version of plain text for faster snippet generation * storing linguistically tagged text for sophisticated data mining * storing image thumbnails etc, etc ... I'm going to prepare a patchset shortly. Any comments and suggestions are welcome. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-377) Add possibility to search for multiple values
[ https://issues.apache.org/jira/browse/NUTCH-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-377. --- Resolution: Won't Fix Add possibility to search for multiple values - Key: NUTCH-377 URL: https://issues.apache.org/jira/browse/NUTCH-377 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Stefan Neufeind Searches with boolean operators (AND or OR) are not (yet) possible. All search-items are always searched with AND. But it would be nice to have the possibility to allow multiple values for a certain field. Maybe that could done using a separator? As an example you might want to search for: somewordsite:www.example.org|www.apache.org Which (to my understand) would allow to search for one or more words with a restriction to those two sites. It would prevent having to implement AND and OR fully (maybe even including brackets) but would allow to cover a few often used cases imho. Easy/hard to do? To my understanding Lucene itself allows AND/OR-searches. So might basically be a problem of string-parsing and query-building towards Lucene? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-386) Plugin to index categories by url rules
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-386. --- Resolution: Won't Fix Plugin to index categories by url rules --- Key: NUTCH-386 URL: https://issues.apache.org/jira/browse/NUTCH-386 Project: Nutch Issue Type: New Feature Components: indexer, searcher Reporter: Ernesto De Santis Priority: Minor Attachments: index-url-category-0.1.zip, index-url-category.jar The compressed zip has a install_notes.txt file with instructions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-453) Move stop words to a config file
[ https://issues.apache.org/jira/browse/NUTCH-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-453. --- Resolution: Won't Fix Move stop words to a config file Key: NUTCH-453 URL: https://issues.apache.org/jira/browse/NUTCH-453 Project: Nutch Issue Type: Improvement Components: indexer, searcher Reporter: Steve Severance Priority: Minor Move the stop words from the code to a config file. This will allow the stop words to be modified without recompiling the code. The format could be the same as the regex-urlfilter where regexs are used to define the words or a plain text file of words could be used. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-260) Three new plugins that parse, index and query meta tags defined in the configuration
[ https://issues.apache.org/jira/browse/NUTCH-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-260. --- Resolution: Won't Fix Three new plugins that parse, index and query meta tags defined in the configuration Key: NUTCH-260 URL: https://issues.apache.org/jira/browse/NUTCH-260 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.7.2 Environment: Built and tested on Linux so far. Reporter: Jake Vanderdray Priority: Minor Attachments: nutch_customizations.tar These plugins allow you to define meta tags in you're nutch-site file that you want to include in parseing, indexing and searching. The query plugin must replace query-basic. The format for adding query terms to nutch-site.xml is: property namemeta.names/name valuekeywords,recommended/value descriptionThis is a comma seperated list of meta tag names that will be parsed, indexed and searched against when parse-meta, index-meta and query-meta are used./description /property property namemeta.boosts/name value1.0,5.0/value descriptionComma seperated list of boost values when searching using query-meta. The order of the values should match the order of meta.names. /description /property Meta tags found are assumed to have either a single value or be a comma seperated list of values. The values found are added to the index as lucene keywords (i.e. meta name=keywords values=First Thing, Second Thing would result in two keyword fields named keywords. The first would countain First Thing and the second would contain Second Thing). I had to replace the query-basic plugin in order to allow matches in the meta fields to return hits even if there were no matches in any of the default fields. The query-basic field only returns hits when every search term is found in at least one default field. I needed hits returned if matches were found in at least one field for every term, and/or the entire search phrase appeared in a meta index field. One known bug is that common terms are not getting stripped out of the fields' values before they get indexed, so The Next Big Thing could not be matched because the query engine will strip out the from all queries. I intend to fix this by stipping out common terms from meta fields before indexing them. Another issue is that searching for Next Big Thing would not match meta index values for Next, Big or Thing. You can consider that a bug or a feature depending on how you look at it. These plugins were written for and only work on the 0.7.2 branch. I'm going to attache a tarball of the source of these three plugins after I create the issue. To use the plugins, you'll need to untar them in your src/plugins directory and add them to the ant build.xml directive (and of course add them in your nutch-site.xml file). If these end up getting added to the project, I'll write up documentation on the wiki. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-542) Null Pointer Exception on getSummary when segment no longer exists
[ https://issues.apache.org/jira/browse/NUTCH-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-542. --- Resolution: Won't Fix Null Pointer Exception on getSummary when segment no longer exists -- Key: NUTCH-542 URL: https://issues.apache.org/jira/browse/NUTCH-542 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Environment: ubuntu, tomcat5.5 Reporter: Jeff V. Priority: Minor If the index refers to a search result in a given segment, but that segment directory does not exist (has been deleted for some reason) the search.jsp will return a completely blank page because a Null Pointer Exception is being thrown from getSummary. At the very least it would be nice to get a more friendly log message such as segment doesn't exist. But ideally the search should continue with just omitting the non-existent results. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-355) The title of query result could like the summary have the highlight??
[ https://issues.apache.org/jira/browse/NUTCH-355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-355. --- Resolution: Won't Fix The title of query result could like the summary have the highlight?? -- Key: NUTCH-355 URL: https://issues.apache.org/jira/browse/NUTCH-355 Project: Nutch Issue Type: Wish Components: searcher Affects Versions: 0.8, 1.0.0 Environment: all Reporter: King Kong Priority: Minor I'd like to make the title hightlight, but i can't found how to do it . when i query Nutch , the result must like this: a href=http://lucene.apache.org/nutch/; Welcome to bNutch/b! /a This is the first bNutch/b release as an Apache Lucene sub-project. See CHANGES.txt for details. The release is available here. ... bNutch/bhas now graduated from the Apache incubator, and is now a Subproject of Lucene. ... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-470) Adding optional terms to a query
[ https://issues.apache.org/jira/browse/NUTCH-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-470. --- Resolution: Won't Fix Adding optional terms to a query Key: NUTCH-470 URL: https://issues.apache.org/jira/browse/NUTCH-470 Project: Nutch Issue Type: Wish Components: searcher Affects Versions: 0.9.0 Environment: Any Reporter: Trond Andersen Priority: Minor Attachments: optional.patch I'm missing API to add optional terms in the query class. Made a small adjustment to the API to support this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-764) Add support for vfsfile:// loading of plugins for JBoss
[ https://issues.apache.org/jira/browse/NUTCH-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-764. --- Resolution: Won't Fix Add support for vfsfile:// loading of plugins for JBoss --- Key: NUTCH-764 URL: https://issues.apache.org/jira/browse/NUTCH-764 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Environment: JBoss AS 5.1.0 Reporter: tcur...@approachingpi.com Priority: Trivial In the file: /src/java/org/apache/nutch/plugin/PluginManifestParser.java There is a check to make sure that the plugin file location is a url formatted like file://path/plugins. When deployed on Jboss, the file protocol will sometimes be: vfsfile://path/plugins. The code with vfsfile can operate the same so I propose a change to the check to also allow this protocol. This would allow Nutch to be deployed on the newer versions of JBoss without any modification. Here is a simple patch: Index: src/java/org/apache/nutch/plugin/PluginManifestParser.java === --- src/java/org/apache/nutch/plugin/PluginManifestParser.javaMon Nov 09 20:20:51 EST 2009 +++ src/java/org/apache/nutch/plugin/PluginManifestParser.javaMon Nov 09 20:20:51 EST 2009 @@ -121,7 +121,8 @@ } else if (url == null) { LOG.warn(Plugins: directory not found: + name); return null; - } else if (!file.equals(url.getProtocol())) { + } else if (!file.equals(url.getProtocol()) +!vfsfile.equals(url.getProtocol())) { LOG.warn(Plugins: not a file: url. Can't load plugins from: + url); return null; } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ https://issues.apache.org/jira/browse/NUTCH-290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-290. --- Resolution: Won't Fix parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: https://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8 Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-358) Language Switching PROBLEM FIXED
[ https://issues.apache.org/jira/browse/NUTCH-358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-358. --- Resolution: Won't Fix Language Switching PROBLEM FIXED Key: NUTCH-358 URL: https://issues.apache.org/jira/browse/NUTCH-358 Project: Nutch Issue Type: Bug Components: web gui Affects Versions: 0.8 Environment: Linx ubuntu 6.0.6 jakarta-tomcat-5.0.28 nutch-0.8 Reporter: David Podunavac Priority: Trivial Language selection on bottom of page does not affect the result page. So if browser language config is set to e.g. en result page(search.jsp) will be displayed in EN browsers language. NO matter what language has been selected (the locale links of the bottom of page). request.getParameter=lang is useless as far as i can see So the links on bottom of the page does not translate the reslutpages keywords. This must be a BUG and shall be reported what i did now for that reason. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ https://issues.apache.org/jira/browse/NUTCH-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-389. --- Resolution: Won't Fix a url tokenizer implementation for tokenizing index fields : url and host - Key: NUTCH-389 URL: https://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor Attachments: urlTokenizer-improved.diff, urlTokenizer.diff NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, site and host fields. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-396) mergesegs sorts URLs, making segments useless for subsequent fetch
[ https://issues.apache.org/jira/browse/NUTCH-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-396. --- Resolution: Won't Fix mergesegs sorts URLs, making segments useless for subsequent fetch -- Key: NUTCH-396 URL: https://issues.apache.org/jira/browse/NUTCH-396 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 0.8 Environment: Mac OS X 10.4.7 Reporter: Doug Cook Priority: Minor Mergesegs leaves the output segment in URL-sorted order. This is a problem if the segment was just generated and not yet fetched - the fetcher likes the URLs to be in essentially random order (sort by URL hash or similar). If I fetch a segment created by mergesegs, my performance is extremely poor since all URLs from a given host will be grouped together and the per-host delays kill me. I have a local fix which I am using: map using a key of MD5(URL) + URL, then, during the reduce phase, chop the MD5 off the front to get the original URL. This is simple, has essentially random order, no problems with collisions, and seems to work nicely. The only thing I don't know is whether or not there is some other tool expecting the sorted order (I would expect not, since generate does not produce this). Right now I have my fix as an option (-randomize), but if there is no other tool requiring sorted order, it's probably cleaner to just make this non-optional. Thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-326) WordExtractor throws java.util.NoSuchElementException on some documents
[ https://issues.apache.org/jira/browse/NUTCH-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-326. --- Resolution: Won't Fix WordExtractor throws java.util.NoSuchElementException on some documents --- Key: NUTCH-326 URL: https://issues.apache.org/jira/browse/NUTCH-326 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.7.1, 0.7.2 Reporter: Tom Jensen Priority: Minor At line 156 in org.apache.nutch.parse.msword.WordExtractor it will on occassion throw a java.util.NoSuchElementException because there is no checking as to whether or not the Iterator has been exhausted. Suggest adding this: if (!textIt.hasNext()) { break; } just before line 156. Tested with problem word documents. Results were Exceptions no longer being thrown and text extracted successfully. Other documents that successfully had their text extracted previously continued to do so. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-352) Add jar command to bin/nutch to allow launching hadoop job jars
[ https://issues.apache.org/jira/browse/NUTCH-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-352. --- Resolution: Won't Fix Add jar command to bin/nutch to allow launching hadoop job jars --- Key: NUTCH-352 URL: https://issues.apache.org/jira/browse/NUTCH-352 Project: Nutch Issue Type: New Feature Reporter: David Cathcart Priority: Minor Attachments: nutch-jobjar.diff Add the ability to run hadoop job jars via bin/nutch jar jobjar.jar. See attachment for patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-343) Index MP3 SHA1 hashes
[ https://issues.apache.org/jira/browse/NUTCH-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-343. --- Resolution: Won't Fix Index MP3 SHA1 hashes - Key: NUTCH-343 URL: https://issues.apache.org/jira/browse/NUTCH-343 Project: Nutch Issue Type: New Feature Affects Versions: 0.8, 0.8.1, 0.9.0 Reporter: Hasan Diwan Attachments: parsemp3.pat Add indexing of the mp3s sha1 hash. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-26) New Http Authentication mechanism
[ https://issues.apache.org/jira/browse/NUTCH-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-26. -- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira New Http Authentication mechanism - Key: NUTCH-26 URL: https://issues.apache.org/jira/browse/NUTCH-26 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Stefan Groschupf Priority: Trivial transferred from: http://sourceforge.net/tracker/index.php?func=detailaid=990560group_id=59548atid=491356 submitted by: Matt Here's a patch and lib (commons-codec used for Base64 encoding) which implements hasic http authentication. I've attempted to build it so we can add more authentication methods at a later time. This also includes the previously discussed MultiProperties class which allows multiple headers with the same name (as opposed to Properties which allows only a single). I believe both John and Doug have had some comments on this. Matt -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-259) Problem in IndexSorter after dedup
[ https://issues.apache.org/jira/browse/NUTCH-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-259. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Problem in IndexSorter after dedup -- Key: NUTCH-259 URL: https://issues.apache.org/jira/browse/NUTCH-259 Project: Nutch Issue Type: Bug Components: indexer Reporter: Michael Priority: Minor When trying to run IndexSorter i'm getting an error: Exception in thread main java.lang.IllegalArgumentException: attempt to access a deleted document at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:282) at org.apache.lucene.index.FilterIndexReader.document(FilterIndexReader.java:104) at org.apache.nutch.indexer.IndexSorter$SortingReader.document(IndexSorter.java:170) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:186) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:579) at org.apache.nutch.indexer.IndexSorter.sort(IndexSorter.java:240) at org.apache.nutch.indexer.IndexSorter.main(IndexSorter.java:291) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-283) If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads
[ https://issues.apache.org/jira/browse/NUTCH-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-283. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads Key: NUTCH-283 URL: https://issues.apache.org/jira/browse/NUTCH-283 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Scott Ganyo Attachments: patch.txt, patch.txt If a Fetcher has chosen to time out and has abandoned outstanding Fetcher Threads, resources that those Fetcher Threads may be using are closed. This naturally causes any abandoned Fetcher Threads to fail when they later attempt to finish up their work in progress. I have a patch that addresses this that I am attaching. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH
[ https://issues.apache.org/jira/browse/NUTCH-158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-158. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Process Sitemap data in text, rss or xml format as well as OAI-PMH -- Key: NUTCH-158 URL: https://issues.apache.org/jira/browse/NUTCH-158 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8 Reporter: byron miller Priority: Minor Add support to the fetcher to look for sitemap files, download them and process them into webdb. Perhaps create a robots.txt directive that can be used to create a standard format for sitemaps in RSS, XML or text format (one line per url) and process that. I would love to see someone stomp on proprietary sitemap features or making things so google specific as they are today :) * RSS format/Atom Format (standard) * XML meta descroption * OAI-PMH meta description (http://www.openarchives.org/OAI/openarchivesprotocol.html) Perhaps even a pre crawler that will scour for these to inject into the web db to help build your link map so you could even just index topN. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-251) Administration GUI
[ https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-251. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Administration GUI -- Key: NUTCH-251 URL: https://issues.apache.org/jira/browse/NUTCH-251 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Minor Attachments: Nutch-251-AdminGUI.tar.gz, hadoop_nutch_gui_v1.patch, nutch_gui_plugins_v1.zip, nutch_gui_v1.patch Having a web based administration interface would help to make nutch administration and management much more user friendly. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-164) Locale (language) choice by first session has global effect to all sessions
[ https://issues.apache.org/jira/browse/NUTCH-164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-164. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Locale (language) choice by first session has global effect to all sessions --- Key: NUTCH-164 URL: https://issues.apache.org/jira/browse/NUTCH-164 Project: Nutch Issue Type: Bug Components: web gui Affects Versions: 0.7.1 Environment: any Reporter: KuroSaka TeruHiko Here's a report posted on nutch-users ML by Sergio [red...@redsun.homeip.net] on 1/02/2006: I just installed nutch in a Fedora Core 3 server. Once installed, I crawled a small site to test it. I opened my navigator (mozilla 1.7 which reports by default ES-ES locales, and everything was ok). Then I asked a friend of mine (the owner of the server) to test it. He did a search with an EN-US locale navigator, and the search page appeared in Spanish. After a few hours, I did the following: I restarted tomcat, I changed the locale of my mozilla to EN, and I opened the search page. Now I always get English search page even if I open with a mozilla ES-ES locale. I wrote a message to my friend: nutch keeps the locale of the first navigator that makes a request for all other requests. By this reason, yesterday as the first request was from my ES locale browser, you saw the page in Spanish with your browser that reports EN locale. There is a way to make this work: * Making sure that, after the server is restarted, the first request is done by a browser that reports EN locale. This happened in my environment too. After taking a look the code, I believe this is caused by use of the default message bundle in search.jsp. The code snipplet looks like: i18n:bundle baseName=org.nutch.jsp.search/ ... titleNutch: i18n:message key=title//title ... The default message bundle probably has the application scope. Because of that, the first setting of the language has global effect to every session created afterward. The right fix is to limit the scope to the session by inserting the scope specifier, as in: i18n:bundle scope=session baseName=org.nutch.jsp.search/ Other JSP files need to be inspected for the same issue and should be fixed as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-162) country code jp is used instead of language code ja for Japanese
[ https://issues.apache.org/jira/browse/NUTCH-162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-162. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira country code jp is used instead of language code ja for Japanese Key: NUTCH-162 URL: https://issues.apache.org/jira/browse/NUTCH-162 Project: Nutch Issue Type: Bug Components: web gui Affects Versions: 0.7.1 Environment: n/a Reporter: KuroSaka TeruHiko Priority: Trivial Attachments: anchors_ja.properties, cached_ja.properties, explain_ja.properties, search_ja.properties, text_ja.properties In locale switching link for Japanese, jp is used as language code but it is an ISO country code. The language code ja should be used. By the way, I don't think many users are familiar with the ISO language codes. A Canadian user may click on ca uknowoing that ca stands for Catalan, not Canadian English or French. Rather than listing the language code, listing the language names in the prospective languages may be better. (I say may be because the browser could show some language names in corrupted text if the current font does not support that language --- this is a difficult problem.) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-441) Thai Analyzer Plugin
[ https://issues.apache.org/jira/browse/NUTCH-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-441. --- Resolution: Won't Fix Thai Analyzer Plugin Key: NUTCH-441 URL: https://issues.apache.org/jira/browse/NUTCH-441 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Vee Satayamas Attachments: nutch-plugin-analysis-th-20070207.patch.gz This Thai analyzer plugin was created by coping and modifying the French analyzer plugin. However, there is no Thai analyzer in lucene-analyzers-2.0.0.jar (in lib-lucene-analyzers). Thus lucene-analyzers-nightly.jar was used instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-224) Nutch doesn't handle Korean text at all
[ https://issues.apache.org/jira/browse/NUTCH-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-224. --- Resolution: Won't Fix Nutch doesn't handle Korean text at all --- Key: NUTCH-224 URL: https://issues.apache.org/jira/browse/NUTCH-224 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.7.1 Reporter: KuroSaka TeruHiko I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+ means a Unicode character of the hex value ) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. I posted the above message at nutch-user ML and Cheolgoo Kang [app...@gmail.com] replied as: There was similar issue with Lucene's StandardTokenizer.jj. http://issues.apache.org/jira/browse/LUCENE-444 and http://issues.apache.org/jira/browse/LUCENE-461 I'm have almost no experience with Nutch, but you can handle it like those issues above. Both fixes should probably be ported back to NuatchAnalysis.jj. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-568) Indexer does not update the Lucene TITLE field
[ https://issues.apache.org/jira/browse/NUTCH-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-568. --- Resolution: Won't Fix Indexer does not update the Lucene TITLE field Key: NUTCH-568 URL: https://issues.apache.org/jira/browse/NUTCH-568 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: Windows XP Reporter: smorales Attachments: RN-071018-24.html Hi, The indexer is unable to update the field TITLE of the Lucene index when processing specific html documents. This issue has been reproduced using Nutch-Nightly Build #241 (Oct 19, 2007 4:01:28 AM) The problem does not occurs using NUTCH 9.0. Workflow: 1.- Extracted package and copy across the following configuration files from NUTCH 9.0 - {nutch_home_9.0}/bin/url folder, containing the urls - {nutch_home_9.0}/conf/nutch-site.xml - {nutch_home_9.0}/conf/crawl-urlfilter.txt 2.- To reproduce the issue, you need to copy the attached html document to your webserver/filesytem. 3.- Run the crawl. For example: ./nutch crawl urls -dir crawl -depth 22 4.- Open the index using Luke. For this test, I used lukeall-0.7.1.jar 5.- Select the window select the document tab, move thru the docs until you find our html document. You will see that the TITLE field is empty -- INCORRECT because this html document contains a title. 6.- Now, open the html document, add a space anywhere then save it again. 7.- Repeat step 3 and 4. You will notice that this time the field TITLE field contains the correct information Please advice, Many thanks in advance for your support. Sergio -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-249) black- white list url filtering
[ https://issues.apache.org/jira/browse/NUTCH-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-249. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira black- white list url filtering --- Key: NUTCH-249 URL: https://issues.apache.org/jira/browse/NUTCH-249 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Stefan Groschupf Assignee: Dennis Kubes Priority: Trivial Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch, bw.patch Existing url filter mechanisms need to process each url against each filter pattern. For very large filter sets this may be does not scale very well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-709) JSParseFilter gets into an infinate loop and ets all the stack
[ https://issues.apache.org/jira/browse/NUTCH-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-709. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira JSParseFilter gets into an infinate loop and ets all the stack --- Key: NUTCH-709 URL: https://issues.apache.org/jira/browse/NUTCH-709 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Hadoop 0.19.0 running nutch trunk Reporter: Tim Hawkins Attachments: JSParseFilter.error.patch When crawling pages with seperate fetch and parse, I see processes die becuase of stack overflow. Output is generaly. java.lang.StackOverflowError at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:146) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148) Inspection of the code shows that this is a recursive call to walk(.) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-289) CrawlDatum should store IP address
[ https://issues.apache.org/jira/browse/NUTCH-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-289. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira CrawlDatum should store IP address -- Key: NUTCH-289 URL: https://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Doug Cutting Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, ipInCrawlDatumDraftV5.patch If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.
[ https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-496. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira ConcurrentModificationException can be thrown when getSorted() is called. - Key: NUTCH-496 URL: https://issues.apache.org/jira/browse/NUTCH-496 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: Nutch application, during fetch. Reporter: Briggs Attachments: language_analyzer_ngram.patch, nutch-496.txt NGramProfile (within the org.apache.nutch.analysis.lang) package is not thread-safe due to a ConcurrentModificationException that can occur if during iteration of the resultant List from getSorted() and another call to getSorted() is invoked from within another thread. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-424) NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4))
[ https://issues.apache.org/jira/browse/NUTCH-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-424. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4)) -- Key: NUTCH-424 URL: https://issues.apache.org/jira/browse/NUTCH-424 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8.1, 0.9.0 Environment: Linux and Windows Reporter: Karsten Dello I've tracked down occasional fetcher hangs to NekoHTML's DOMFragmentParser hanging certain HTML documents, for example, http://www.inlandrevenue.gov.uk/charities/chapter_3.htm. The thread dump on the hung parser is: CompilerThread0 daemon prio=1 tid=0x080c4c18 nid=0x47da waiting on condition [0x..0x8a3daf68] Signal Dispatcher daemon prio=1 tid=0x080c3d60 nid=0x47d9 waiting on condition [0x..0x] Finalizer daemon prio=1 tid=0x080b8818 nid=0x47d8 in Object.wait() [0x8a2a..0x8a2a0680] at java.lang.Object.wait(Native Method) - waiting on 0x4a60d058 (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked 0x4a60d058 (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) Reference Handler daemon prio=1 tid=0x080b7b50 nid=0x47d7 in Object.wait() [0x8a21f000..0x8a21f800] at java.lang.Object.wait(Native Method) - waiting on 0x4a60d0d8 (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:474) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked 0x4a60d0d8 (a java.lang.ref.Reference$Lock) main prio=1 tid=0x0805c170 nid=0x47d1 waiting on condition [0xbfffc000..0xbfffcec8] at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393) at java.lang.StringBuffer.append(StringBuffer.java:225) - locked 0x45910118 (a java.lang.StringBuffer) at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source) at org.cyberneko.html.parsers.DOMFragmentParser.characters(Unknown Source) at org.cyberneko.html.filters.DefaultFilter.characters(Unknown Source) at org.cyberneko.html.HTMLTagBalancer.characters(Unknown Source) at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(Unknown Source) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(Unknown Source) at org.cyberneko.html.HTMLScanner.scanDocument(Unknown Source) at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source) at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source) at org.cyberneko.html.parsers.DOMFragmentParser.parse(Unknown Source) at net.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:157) at net.nutch.parse.ParserChecker.main(ParserChecker.java:74) VM Thread prio=1 tid=0x080b4f30 nid=0x47d6 runnable VM Periodic Task Thread prio=1 tid=0x080c75f8 nid=0x47dc waiting on condition Using the URL mentioned above, I was able to successfully parse the file using a normal NekoHTML DocumentParser. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-119) Regexp to extract outlinks incorrect
[ https://issues.apache.org/jira/browse/NUTCH-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-119. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Regexp to extract outlinks incorrect Key: NUTCH-119 URL: https://issues.apache.org/jira/browse/NUTCH-119 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.7.1, 0.7.2, 0.8 Reporter: Sébastien Le Callonnec Attachments: TestPattern.java, TestPattern.java The regexp which extracts outlinks is incorrect. It extracts in-line CSS styles, and leaves out link such as a href=/sitemap.htmlbrowse/a. This has been reported by Earl Cahill on the user list. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-414) parse-mp3 plugin concatenating previous tags for text field
[ https://issues.apache.org/jira/browse/NUTCH-414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-414. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira parse-mp3 plugin concatenating previous tags for text field --- Key: NUTCH-414 URL: https://issues.apache.org/jira/browse/NUTCH-414 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: - Reporter: Brian Whitman The parse-mp3 plugin seems to be saving a state of the previous parse's text content. For every new mp3 file parsed, it is putting the contents of all the previous text fields in the plain text field for that file. You can see this by fetching a set of mp3s in one segment, then viewing their plain text in the nutch webapp. The plaintext will include the contents of all files fetched in that round, which makes searching fruitless. I made a tiny band-aid change to MP3Parser.java and MetadataCollector.java against the nightly. It seems to fix the problem. --- MP3Parser.java 2006-12-10 09:43:26.0 -0500 +++ MP3Parser.java.new 2006-12-10 16:37:03.0 -0500 @@ -67,7 +67,7 @@ fos.write(raw); fos.close(); MP3File mp3 = new MP3File(tmp); - + metadataCollector.clearText(); if (mp3.hasID3v2Tag()) { parse = getID3v2Parse(mp3, content.getMetadata()); } else if (mp3.hasID3v1Tag()) { --- MetadataCollector.java 2006-12-10 09:43:26.0 -0500 +++ MetadataCollector.java.new 2006-12-10 16:37:28.0 -0500 @@ -42,6 +42,10 @@ this.conf = conf; } + public void clearText() { + text = ; + } + public void notifyProperty(String name, String value) throws MalformedURLException { if (name.equals(TIT2-Text)) setTitle(value); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-113) Disable permanent DNS-to-IP caching for JVM 1.4
[ https://issues.apache.org/jira/browse/NUTCH-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-113. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Disable permanent DNS-to-IP caching for JVM 1.4 --- Key: NUTCH-113 URL: https://issues.apache.org/jira/browse/NUTCH-113 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.7.2, 0.8 Reporter: Fuad Efendi Priority: Trivial DNS-to-IP mapping may change during long crawls, by default JVM 1.4 caches it forever. Some related discussions at Jakarta-HttpClient-User http://mail-archives.apache.org/mod_mbox/jakarta-httpclient-user/200506.mbox/%3c20050627022440.SVIL13442.lakermmtao05.cox.net@zeus%3e http://java.sun.com/j2se/1.4.2/docs/guide/net/properties.html networkaddress.cache.ttl (default: -1) Specified in java.security to indicate the caching policy for successful name lookups from the name service.. The value is specified as as integer to indicate the number of seconds to cache the successful lookup. A value of -1 indicates cache forever. We probably need this code in org.apache.nutch.fetcher.Fetcher: private static final int FETCHER_DNS_TTL_MINUTES = NutchConf.get().getInt(fetcher.dns.ttl.minutes, 120); static { java.security.Security.setProperty(networkaddress.cache.ttl, + FETCHER_DNS_TTL_MINUTES*60); } And, new property in nutch-default.xml: property namefetcher.dns.ttl.minutes/name value120/value descriptionDNS-to-IP cache, Time-to-Live/description /property -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-87) Efficient site-specific crawling for a large number of sites
[ https://issues.apache.org/jira/browse/NUTCH-87?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-87. -- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Efficient site-specific crawling for a large number of sites Key: NUTCH-87 URL: https://issues.apache.org/jira/browse/NUTCH-87 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.7.2, 0.8 Environment: cross-platform Reporter: AJ Chen Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, build.xml.patch-0.8, urlfilter-whitelist.tar.gz There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 10 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that nutch crawl command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later. There is a discussion about adding a URLFilter to implement this requested feature, see the following thread - http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard. But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin. AJ Chen -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-460) RDF parser plugin
[ https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-460. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira RDF parser plugin - Key: NUTCH-460 URL: https://issues.apache.org/jira/browse/NUTCH-460 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Ricardo J. Méndez Attachments: rubyspider-rdf.zip I've written a couple plugins that I'd like to contribute. RDFLinkParseFilter looks for links on the pages that point towards RDF information, and tags the pages with metadata about the type of links they hold. RDFLinkIndexingFilter indexes said metadata. RDFParser parses RDF information from several possible formats using Jena, and extracts the links that the file points to as Outlinks so that they can be fetched as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-182) Log when db.max configuration limits reached
[ https://issues.apache.org/jira/browse/NUTCH-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-182. --- Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Log when db.max configuration limits reached Key: NUTCH-182 URL: https://issues.apache.org/jira/browse/NUTCH-182 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Matt Kangas Priority: Trivial Attachments: LinkDb.java.patch, ParseData.java.patch Followup to http://www.nabble.com/Re%3A-Can%27t-index-some-pages-p2480833.html There are three db.max parameters currently in nutch-default.xml: * db.max.outlinks.per.page * db.max.anchor.length * db.max.inlinks Having values that are too low can result in a site being under-crawled. However, currently there is nothing written to the log when these limits are hit, so users have to guess when they need to raise these values. I suggest that we add three new log messages at the appropriate points: * Exceeded db.max.outlinks.per.page for URL * Exceeded db.max.anchor.length for URL * Exceeded db.max.inlinks for URL -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-826) Mailing list is broken.
[ https://issues.apache.org/jira/browse/NUTCH-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-826. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Mailing list is broken. --- Key: NUTCH-826 URL: https://issues.apache.org/jira/browse/NUTCH-826 Project: Nutch Issue Type: Bug Reporter: John Sherwood Assignee: Julien Nioche Priority: Blocker Fix For: 1.1 All of the following addresses are failing: nutch-u...@nutch.apache.org nutch-user-subscr...@nutch.apache.org nutch-user-subscr...@lucene.apache.org For the last one, the mailer daemon said This mailing list has moved to user at nutch.apache.org. Below is the message I tried to send: Hi people, I've been banging my head against this problem for two days now. Simply, I want to add a field with the value of a given meta tag. I've been trying the parse-xml plugin, but that seems that it doesn't work with version 1.0. I've tried the code at http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html and it hasn't worked. I don't even know why. I don't even know if my plugin is being used... or even looked for! Nutch seems to have a infuriating Fail silently policy for plugins. I put a System.exit(1) in my filters just to see if my code is even being encountered. It has not in spite of my config telling it to. Here's my config: nutch-site.xml ... property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|metadata/value /property ... parse-plugins.xml ... mimeType name=application/xhtml+xml plugin id=parse-html / plugin id=metadata / /mimeType mimeType name=text/html plugin id=parse-html / plugin id=metadata / /mimeType mimeType name=text/sgml plugin id=parse-html / plugin id=metadata / /mimeType mimeType name=text/xml plugin id=parse-html / plugin id=parse-rss / plugin id=metadata / plugin id=feed / /mimeType ... alias name=metadata extension-id=com.example.website.nutch.parsing.MetaTagExtractorParseFilter / ... I've also copied the plugin.xml and jar from my build/metadata to the plugins root dir. Nonetheless, Nutch runs and puts data in solr for me. Afaik, Nutch is completely unaware of my plugin despite my config options. Is the some other place I need to tell Nutch to use my plugin? Is there some other approach to do this without having to write a plugin? This does seem like a lot of work to simply get a meta tag into a field. Any help would be appreciated. Sincerely, John Sherwood -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-570) Improvement of URL Ordering in Generator.java
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-570. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Improvement of URL Ordering in Generator.java - Key: NUTCH-570 URL: https://issues.apache.org/jira/browse/NUTCH-570 Project: Nutch Issue Type: Improvement Components: generator Reporter: Ned Rockson Assignee: Otis Gospodnetic Priority: Minor Attachments: GeneratorDiff.out, GeneratorDiff_v1.out [Copied directly from my email to nutch-dev list] Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time). I found that the URLs generated are not optimal because they are simply randomized by a hash comparator. In one crawl on 24 machines it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time. Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible. So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster. Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-742) Checksum Error
[ https://issues.apache.org/jira/browse/NUTCH-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-742. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Checksum Error --- Key: NUTCH-742 URL: https://issues.apache.org/jira/browse/NUTCH-742 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: linux ubuntu8.0.4 64bit 10datanode 4G of memory per node Reporter: mawanqiang Approximately 1 million data used to create index when nutch1.0 error. The error is: java.lang.RuntimeException: problem advancing post rec#6758513 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:883) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:79) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:153) at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:90) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:301) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:331) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:315) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:377) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:174) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:277) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:297) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:922) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:881) ... 6 more -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-44) too many search results
[ https://issues.apache.org/jira/browse/NUTCH-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-44. -- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira too many search results --- Key: NUTCH-44 URL: https://issues.apache.org/jira/browse/NUTCH-44 Project: Nutch Issue Type: Bug Components: web gui Environment: web environment Reporter: Emilijan Mirceski Assignee: Dennis Kubes Attachments: NUTCH-44-2-20080215.patch, NUTCH-44.patch There should be a limitation (user defined) on the number of results the search engine can return. For example, if one modifies the seach url as: http://my/search.jsp?query=some quieryhitsPerPage=2hitsPerSite=0 The search will try to return 20,000 pages which isn't good for the server side performance. Is it possible to have a setting in the config xml files to control this? Thanks, Emilijan -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-854) Define standard attributes with values and explaination to configuration files in conf directory
[ https://issues.apache.org/jira/browse/NUTCH-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-854. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Define standard attributes with values and explaination to configuration files in conf directory Key: NUTCH-854 URL: https://issues.apache.org/jira/browse/NUTCH-854 Project: Nutch Issue Type: Improvement Environment: Window XP SP3, Cygwin, JDK 1.6.20, Ant 1.8.1 Reporter: Pham Tuan Minh Fix For: 2.0 It would make nutch easier to use if all configuration file in conf directory is defined standard attributes with values and explanation. For example, currently nutch-site.xml.template contains no attributes and no explanation, we should define them. - ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- site-specific property overrides in this file. -- configuration !-- Agent name-- property namehttp.agent.name/name valuenutch-solr-integration/value /property ! property namegenerate.max.per.host/name value100/value /property property !-- plug-in using in this site -- nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-tika|scoring-opic|urlnormalizer-(pass|regex|basic)/value /property /configuration - Thanks, -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-958) Httpclient scheme priority order fix
[ https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-958. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Httpclient scheme priority order fix Key: NUTCH-958 URL: https://issues.apache.org/jira/browse/NUTCH-958 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Claudio Martella Fix For: 1.3 Attachments: httpclient.diff Httpclient will try to authenticate in this order by default: ntlm, digest, basic. If you set as default a scheme that comes in this list after a scheme that is negotiated by the server, and this authentication fails, the default scheme will not be tried. I.e. if you set digest as default scheme but the server negotiates ntlm, the client will still try ntlm and fail. The fix sets the default scheme as the only possible scheme for authentication for the given realm by setting the authentication priorities of httpclient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-866) STOP Nutch without breaking the crawled data
[ https://issues.apache.org/jira/browse/NUTCH-866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-866. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira STOP Nutch without breaking the crawled data Key: NUTCH-866 URL: https://issues.apache.org/jira/browse/NUTCH-866 Project: Nutch Issue Type: New Feature Reporter: Pham Tuan Minh Fix For: 2.0 How we can stop running nutch instance in local mode and in reducer mode without breaking the crawled data? For example, you push a list of site that take a long time to complete crawl; then you want to stop nutch instance suddenly ... - For local mode, I suggest as below We create a stop.txt file in specific directory, then for a piece of time, nutch instance will check whether this file existed or not; if existed, nutch instance will stop itself normally - For reducer mode, may we use zookeper to keep state of each instance? Any other suggestion? Thanks, -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-86) LanguageIdentifier API enhancements
[ https://issues.apache.org/jira/browse/NUTCH-86?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-86. -- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira LanguageIdentifier API enhancements --- Key: NUTCH-86 URL: https://issues.apache.org/jira/browse/NUTCH-86 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.6, 0.7, 0.8 Reporter: Jerome Charron Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.2, 2.0 More informations can be found on the following thread on Nutch-Dev mailing list: http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html Summary: 1. LanguageIdentifier API changes. The similarity methods should return an ordered array of language-code/score pairs instead of a simple String containing the language-code. 2. Ensure consistency between LanguageIdentifier scoring and NGramProfile.getSimilarity(). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-591) StringIndexOutOfBoundsException when extracting text from a Word document.
[ https://issues.apache.org/jira/browse/NUTCH-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-591. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira StringIndexOutOfBoundsException when extracting text from a Word document. -- Key: NUTCH-591 URL: https://issues.apache.org/jira/browse/NUTCH-591 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: linux redhat as4u4 x86 kernel 2.6.9 Reporter: frank ling see http://issues.apache.org/bugzilla/show_bug.cgi?id=41076+ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-363) Fetcher normalizes everything at least twice
[ https://issues.apache.org/jira/browse/NUTCH-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-363. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Fetcher normalizes everything at least twice Key: NUTCH-363 URL: https://issues.apache.org/jira/browse/NUTCH-363 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Environment: OS X 10.4.7 Reporter: Doug Cook Priority: Minor Fix For: 2.0 New links are normalized twice by the fetcher: First in DOMContentUtils.getOutlinks, where the constructor Outlink(url.toString(), linkText.toString().trim(), conf) normalizes the URL. The second time is in ParseOutputFormat.write(). For some URLs (e.g. those repeated on a page) a given URL may be normalized a number of times, but it is always normalized at least twice. For those of us with expensive normalizations, this is probably burning some CPU. I'd gladly fix this, but I'm not yet familiar enough with the code to know if there are some hidden assumptions which rely on this behavior. [A related note is that URLs are normalized *before* filtering; this is causing a lot of extra normalization as well. In general, filters may not be safe to run before normalization, but there is likely a class of them which are (filtering out .gif/.jpg etc). Perhaps the notion of a pre-normalizer filter would be a useful one?] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-185) XMLParser is configurable xml parser plugin.
[ https://issues.apache.org/jira/browse/NUTCH-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-185. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira XMLParser is configurable xml parser plugin. Key: NUTCH-185 URL: https://issues.apache.org/jira/browse/NUTCH-185 Project: Nutch Issue Type: New Feature Components: fetcher, indexer Affects Versions: 0.7.2, 0.8, 0.8.1 Environment: OS Independent Reporter: Rida Benjelloun Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: parse-xml.patch, parse-xml.zip, parse-xml.zip Xml parser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. Informations : 1- Copy xmlparser-conf.xml to the nutch/conf dir 2- To index your custom XML file, you have to modify the xmlparser-conf.xml. This parser uses namespaces and XPATH to parse XML content The config file do the mapping between the XML noeds (using XPATH) and lucene field. Example : field name=dctitle xpath=//dc:title type=Text boost=1.4 / 3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace. If the namespace is found in the xml document, the fields represented by the namespace will be indexed. Example : xmlIndexerProperties type=filePerDocument namespace= http://purl.org/dc/elements/1.1/; field name=dctitle xpath=//dc:title type=Text boost= 1.4 / field name=dccreator xpath=//dc:creator type=keyword boost= 1.0 / /xmlIndexerProperties 4- It is possible to define a default namespace that will be applied when the parser didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties. Example : xmlIndexerProperties type=filePerDocument namespace=default field name=xmlcontent xpath=//* type=Unstored boost=1.0 / /xmlIndexerProperties -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-310) Review Log Levels
[ https://issues.apache.org/jira/browse/NUTCH-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-310. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Review Log Levels - Key: NUTCH-310 URL: https://issues.apache.org/jira/browse/NUTCH-310 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Jerome Charron Assignee: Chris A. Mattmann Priority: Minor Fix For: 2.0 Review of logs content and logs levels (see Commons Logging Best Parctices : http://jakarta.apache.org/commons/logging/guide.html#Message_Priorities_Levels) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-659) Help! No urls fetched for internal repository website
[ https://issues.apache.org/jira/browse/NUTCH-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-659. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Help! No urls fetched for internal repository website - Key: NUTCH-659 URL: https://issues.apache.org/jira/browse/NUTCH-659 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: nutch 0.9, TOMCAT6.0.18, JAVA 1.6.0_10, CentOS 5.2 Reporter: Bryan Priority: Critical I am new to Nutch, and implemented Nutch for my internal company websites search. The version is nutch-2008-11-02_04-01-26.tar. My internal company websites includes several HTTP websites. Another one is SVN repository HTTPS websites in XML structure, using dir and file tag. The search in HTTP websites is good. The HTTPS is ok. We have some links in those HTTP websites which point to Word files under SVN website. They can be indexed. But the Nutch does not search my SVN website. If I only search the SVN website, it is always: 0 urls fetched. My nutch-site.xml is as following: property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value # skip file:, ftp:, mailto: urls -^(ftp|mailto): # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*smartlabs.com.au/ Any help would be much appreciated. Thanks in advnce. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-774) Retry interval in crawl date is set to 0
[ https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-774. --- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira Retry interval in crawl date is set to 0 Key: NUTCH-774 URL: https://issues.apache.org/jira/browse/NUTCH-774 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Reinhard Schwab Assignee: Chris A. Mattmann Fix For: 1.2, 2.0 Attachments: NUTCH-774.patch, NUTCH-774_2.patch When i fetch and parse a feed with the feed plugin, http://www.wachauclimbing.net/home/impressum-disclaimer/feed/ another crawl date is generated http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ after fetching a second round the dump in the crawl db still shows a retry interval with value 0. http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ Version: 7 Status: 2 (db_fetched) Fetch time: Wed Dec 02 12:48:22 CET 2009 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 0 seconds (0 days) Score: 1.084 Signature: db9ab2193924cd2d0b53113a500ca604 Metadata: _pst_: success(1), lastModified=0 a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in the method setFetchSchedule -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira