[jira] [Updated] (NUTCH-897) Subcollection requires blacklist element

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-897:


Attachment: NUTCH-897.patch

Attached tested fix and if confirmed to work and not break existing 
configurations. Patch works for 1.3 and trunk.

 Subcollection requires blacklist element
 

 Key: NUTCH-897
 URL: https://issues.apache.org/jira/browse/NUTCH-897
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.3, 2.0

 Attachments: NUTCH-897.patch


 This is a very minor issue with in Subcollection.java. It throws an error if 
 the (empty) blacklist element was omitted. I think it should either not 
 silently fail in case of an omitted blacklist element or throw a decent error 
 message that the blacklist element is required. The following exception gets 
 thrown if the blacklist element is omitted in a subcollection block:
 2010-09-06 13:32:30,438 INFO  collection.CollectionManager - Instantiating 
 CollectionManager
 2010-09-06 13:32:30,438 INFO  collection.CollectionManager - initializing 
 CollectionManager 
 2010-09-06 13:32:30,451 INFO  collection.CollectionManager - file has1 
 elements  

 2010-09-06 13:32:30,456 WARN  collection.CollectionManager - Error 
 occured:java.lang.NullPointerException

 2010-09-06 13:32:30,469 WARN  collection.CollectionManager - 
 java.lang.NullPointerException
  
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.Subcollection.initialize(Subcollection.java:173)  
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.parse(CollectionManager.java:98)
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:75) 
   
 2010-09-06 13:32:30,470 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.init(CollectionManager.java:56)
  
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.collection.CollectionManager.getCollectionManager(CollectionManager.java:115)
   
 
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.addSubCollectionField(SubcollectionIndexingFilter.java:65)
   
  
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter.filter(SubcollectionIndexingFilter.java:71)
   
 
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) 
   
 2010-09-06 13:32:30,471 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:134)   
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)  
   
 2010-09-06 13:32:30,472 WARN  collection.CollectionManager - at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Clean up open legacy issues in Jira

2011-04-01 Thread Mattmann, Chris A (388J)
Super +1 Markus -- I've tried over the past 9 months to do this periodically 
when I've rolled releases, but if everyone could take a look and close out 
really old or non-applicable bugs, that would be great!

BTW, time is freeing up for me lately, so it might be time finally for the 1.3 
release, if folks are cool with me RM'ing it :)

Cheers,
Chris

On Apr 1, 2011, at 7:03 AM, Markus Jelsma wrote:

 Hi guys,
 
 There's an awful lot of legacy in Jira. I propose  we close the bulk of the 
 issues that deal with the old search server, very old plugins or really old 
 code. Thoughts?
 
 Cheers,


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] [Closed] (NUTCH-973) Remove Segment Merger in 1.3

2011-04-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-973.
---

Resolution: Not A Problem

You are right, let's leave it for now. It won't be a problem once we're on 2.0 
anyway

 Remove Segment Merger in 1.3
 

 Key: NUTCH-973
 URL: https://issues.apache.org/jira/browse/NUTCH-973
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Priority: Minor
 Fix For: 1.3


 The code for the segment merging is still in 1.3, as far as I understand its 
 original function it was mostly useful for having a single data structure 
 where the search app could get the cached data from. Now that we've delegated 
 the indexing and search to SOLR we don't really need to worry about the cache 
 anymore. Would it make sense to purge it or do you guys think it would still 
 be useful? 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-39) pagination in search result

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-39?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-39.
--

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 pagination in search result
 ---

 Key: NUTCH-39
 URL: https://issues.apache.org/jira/browse/NUTCH-39
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
 Environment: all
Reporter: Jack Tang
Priority: Trivial

 Now in nutch search.jsp, user navigate all search result using Next button. 
 And google like pagination will feel better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-36) Chinese in Nutch

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-36?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-36.
--

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Chinese in Nutch
 

 Key: NUTCH-36
 URL: https://issues.apache.org/jira/browse/NUTCH-36
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
 Environment: all
Reporter: Jack Tang
Priority: Minor
 Attachments: #26700


 Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term 
 word-by-word. 
 So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), 
 the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we 
 expect Nutch only highlights 'FooBar'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-13) If dns points to 127.0.0.1, the url is also crawled

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-13?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-13.
--

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 If dns points to 127.0.0.1, the url is also crawled
 ---

 Key: NUTCH-13
 URL: https://issues.apache.org/jira/browse/NUTCH-13
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Matthias Jaekle
Priority: Minor

 For example www.tik24.de points to 127.0.0.1.
 If you follow a link to www.tik24.de fetcher will crawl content from your own 
 machine.
 Wrong DNS entries could create unwanted entries in segments.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-79) Fault tolerant searching.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-79.
--

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Fault tolerant searching.
 -

 Key: NUTCH-79
 URL: https://issues.apache.org/jira/browse/NUTCH-79
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Reporter: Piotr Kosiorowski
 Attachments: patch


 I have finally managed to prepare first version of fault tolerant searching I 
 have promised long time ago. 
 It reads server configuration from search-groups.txt file (in startup 
 directory or directory specified by searcher.dir) if no search-servers.txt 
 file is present. If search-servers.txt  is presentit would be read and 
 handled as previously.
 ---
 Format of search-groups.txt:
 * pre
  *  search.group.count=[int] 
  *  search.group.name.[i]=[string] (for i=0 to count-1)
  *  
  *  For each name: 
  *  [name].part.count=[int] partitionCount 
  *  [name].part.[i].host=[string] (for i=0 to partitionCount-1)
  *  [name].part.[i].port=int (for i=0 to partitionCount-1)
  *  
  *  Example: 
  *  search.group.count=2 
  *  search.group.name.0=master
  *  search.group.name.1=backup
  *  
  *  master.part.count=2 
  *  master.part.0.host=host1 
  *  master.part.0.port=
  *  master.part.1.host=host2 
  *  master.part.1.port=
  *  
  *  backup.part.count=2 
  *  backup.part.0.host=host3 
  *  backup.part.0.port=
  *  backup.part.1.host=host4 
  *  backup.part.1.port=
  * /pre.
 
 If more than one search group is defined in configuration file requests are 
 distributed among groups in round-robin fashion. If one of the servers from 
 the group fails to respond the whole group is treated as inactive and removed 
 from the pool used to distributed requests. There is a separate recovery 
 thread that every searcher.recovery.delay seconds (default 60) tries to 
 check if inactive became alive and if so adds it back to the pool of active 
 groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-103) Vivisimo like treeview and url redirect

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-103.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Vivisimo like treeview and url redirect
 ---

 Key: NUTCH-103
 URL: https://issues.apache.org/jira/browse/NUTCH-103
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.8
 Environment: linux
Reporter: robert benea
Priority: Trivial
 Attachments: clusty.tar


 First, I modified cluster.jsp and now the cluster has a vivisimo look. I used 
 javascript to show the treeview.  Another small change is that I call the 
 cluster recursively twice, so that two levels of clustering are shown.
 Second, I added redirect.jsp in order to log the links that were clicked 
 during search and because of that search.jsp is changed as well.
 The code is not clean as all started as an experiment, I hope someone else 
 finds it useful and clean it up ;-). 
 To install it just copy the files where you deployed the nutch.war and will 
 work auto-magically.
 Regards,
 R.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-18) Windows servers include illegal characters in URLs

2011-04-01 Thread David Escuer (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014581#comment-13014581
 ] 

David Escuer commented on NUTCH-18:
---

La persona amb la qui vol contactar estarà fora de les oficines de
SIMPPLE des del 30 de març fins al 7 d'abril, ambdós inclosos.

La persona con la que quiere contactar estará fuera de las oficinas de
SIMPPLE desde el 30 de marzo hasta el 7 de abril, ambos incluidos.

The person you are trying to reach will be out of the office from
march 30 until april 7 (both included).


 Windows servers include illegal characters in URLs
 --

 Key: NUTCH-18
 URL: https://issues.apache.org/jira/browse/NUTCH-18
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Stefan Groschupf
Priority: Minor

 Transfered from:
 http://sourceforge.net/tracker/index.php?func=detailaid=1110243group_id=59548atid=491356
 submitted by:
 Ken Meltsner
 While spidering our intranet, I found that IIS may include 
 illegal characters in URLs -- specifically, characters with 
 the high bit set to produce non-English letters. In 
 addition, both Firefox and IE will accept URLs with high-
 bit characters, but Java won't.
 While this may not be Nutch's (or Java's) fault, it would 
 help if high-bit characters (and other illegal characters) 
 in URLs could be escaped (using percent-hex notation) 
 as part of the URL fix-up process, probably right after 
 the hostname lower-case conversion.
 Example document name in Portuguese(with high-bit 
 characters) taken from a longer URL:
 Nota%20tecnica%20-%20Alteração%20de%
 20escopo.doc
 and with percent-escaped characters:
 Nota%20tecnica%20-%20Altera%e7%e3o%20de%
 20escopo.doc

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-104) Nutch query parser does not support CJK bi-gram segmentation.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-104.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Nutch query parser does not support CJK bi-gram segmentation.
 -

 Key: NUTCH-104
 URL: https://issues.apache.org/jira/browse/NUTCH-104
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.6
 Environment: all
Reporter: Jack Tang
Priority: Minor

 I customize one query filter using test as my field.  And when i try to 
 search test:(c1)(c2)(c3), the query object which is generated by 
 NutchAnalysis is wrong. Now the result is
  test:(c1)(c2) [DEFAULT](c2)(c3).
 However, the expected result is
  test:(c1)(c2) (c2)(c3). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-180) Performance problem with widely used keywords

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-180.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Performance problem with widely used keywords
 -

 Key: NUTCH-180
 URL: https://issues.apache.org/jira/browse/NUTCH-180
 Project: Nutch
  Issue Type: Wish
  Components: searcher
Reporter: Mike Alulin

 It looks like Nutch is very slow when the search phrase includes a few widely 
 used keywords. For example I 1 2 3 4 5 6 7 8 9 0 typed without the quotes 
 to Yahoo, Google, or MSN is processed in less than a second. Nutch on the 
 other hand requires much more time for this even on smaller databases. For 
 example this phrase made objectssearch.com think more than 1 minute although 
 their DB is much smaller than DBs of the big 3 guys. On my test Nutch DB with 
 only 3M pages this phrase took a few seconds to process.
 Unfortunately I do not know much about search algorithms, but it looks like 
 Nutch do have some space to improve the search performance. The current 
 implementation can be easily killed by a few search requests like this. 
 Just a couple of dozen of such requests makes my server with 2 Opterons think 
 for a minute or two with 100% CPU utilization.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-581) DistributedSearch does not update search servers added to search-servers.txt on the fly

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-581.
---


 DistributedSearch does not update search servers added to search-servers.txt 
 on the fly
 ---

 Key: NUTCH-581
 URL: https://issues.apache.org/jira/browse/NUTCH-581
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
Reporter: Rohan Mehta
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-581-2.patch, UpdateSearch.patch


 DistributedSearch client updates the search servers added to the 
 search-servers.txt file on the fly. 
 This patch will updates the search servers on the fly and the client does not 
 need a restart.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-877) Allow setting of slop values for non-quote phrase queries on query-basic plugin

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-877.
---


 Allow setting of slop values for non-quote phrase queries on query-basic 
 plugin
 ---

 Key: NUTCH-877
 URL: https://issues.apache.org/jira/browse/NUTCH-877
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.2
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.2

 Attachments: NUTCH-877-1-20100809.patch


 Patch adds a configuration variable for setting slop values on phrase 
 queries.  The default slop value, which currently can't be changed through 
 configuration, is Integer.MAX_VALUE.  It produces something like this, which 
 doesn't seem right to me.  If you are searching for a phrase you usually want 
 it within a certain distance:
 2.9141337E-4 = weight(content:my phrase~2147483647 in 1029), product of:
 * 0.07163286 = queryWeight(content:my phrase~2147483647), product of:
   o 9.657982 = idf(content: my=13470 phrase=534)
   o 0.0074169594 = queryNorm
 This patch adds the query.phrase.slop configuration value to the 
 nutch-default.xml file.  It has a default setting of 5.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-265) Getting Clustered results in better form.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-265:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Getting Clustered results in better form.
 -

 Key: NUTCH-265
 URL: https://issues.apache.org/jira/browse/NUTCH-265
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.7.2
Reporter: Kris K

 The cluster results are coming with title and link to URL. For improvement it 
 should be clustered keyword phrases (Like  Vivisimo type). Any person can 
 share their views on it. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-674) NutchBean doesn't check for searcher.dir existance.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-674:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 NutchBean doesn't check for searcher.dir existance.
 ---

 Key: NUTCH-674
 URL: https://issues.apache.org/jira/browse/NUTCH-674
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: Looks like platform independent problem.
Reporter: Kuba Kończyk

 If searcher.dir doesn't exists or it's not accessible, searcher will just 
 continue and report that there is 0 hits found.It should throw an exception 
 or log an error instead.As an starting point, there was a patch proposed some 
 time ago on Nuch-dev: 
 http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg09422.html
  to solve this problem.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-423) Add other index-basic fields as query plugins

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-423:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Add other index-basic fields as query plugins
 -

 Key: NUTCH-423
 URL: https://issues.apache.org/jira/browse/NUTCH-423
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
Reporter: stack
Priority: Minor
 Attachments: other-index-basic-query-fields.patch


 The basic indexer plugin adds 'host', 'site', 'url', 'content', 'title', and 
 'anchor'.  The query-basic plugin expands queries against the 'default' field 
 to run against all basic indexer plugin fields.  The query-url pluging adds 
 query filtering on the 'url' field and query-site' on 'site'.  This patch 
 adds plugins to filter on the remainder: 'host', 'content', 'title', and 
 'anchor'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-47) Configure host filter to do wildcard prefixes - *.redhat.com

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-47:
---


Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Configure host filter to do wildcard prefixes - *.redhat.com
 

 Key: NUTCH-47
 URL: https://issues.apache.org/jira/browse/NUTCH-47
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
 Environment: Linux
Reporter: byron miller
Priority: Minor

 Right now you can configure the max results per host for query response, but 
 that seems limited to exact host matches such as www.redhat.com.
 In many ways it would be nice to include the capability to match hosts by 
 wildcard.
 For example search for redhat on mozdex.com:
 http://www.mozdex.com/search.jsp?query=redhat
 And you will see:
 www.apac.redhat.com 
 www.europe.redhat.com 
 www.in.redhat.com 
 Could this be fixed so that *.redhat.com is under find more sources under 
 redhat.com or something like that?
 I may be able to tweak the other processes, but i can envision a problem of 
 people creating www1 www2 www3 or using other country codes for the 
 same/similar content filling up pages of serps for what could be other 
 relevent information.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-943) Search Results default dedup field site should be stored in index.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-943:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Search Results default dedup field site should be stored in index.
 

 Key: NUTCH-943
 URL: https://issues.apache.org/jira/browse/NUTCH-943
 Project: Nutch
  Issue Type: Bug
  Components: indexer, searcher
Affects Versions: 1.2
Reporter: Charan Malemarpuram
 Attachments: NUTCH-943.patch


 site is not configured as a stored field in SOLR schema.
 Search returns only two results always and had See More Hits button, even 
 if the results are from different sites.
 See More
 Attached patch changes the default schema.xml config to store site field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-469:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 changes to geoPosition plugin to make it work on nutch 0.9
 --

 Key: NUTCH-469
 URL: https://issues.apache.org/jira/browse/NUTCH-469
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Mike Schwartz
 Attachments: NUTCH-469-2007-05-09.txt.gz, geoPosition-0.5.tgz, 
 geoPosition0.6_cdiff.zip


 I have modified the geoPosition plugin 
 (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9.  (The 
 code was built originally using nutch 0.7.)  I'd like to contribute my 
 changes back to the nutch project.  I already communicated with the code's 
 author (Matthias Jaekle), and he agrees with my mods.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-377) Add possibility to search for multiple values

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-377:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Add possibility to search for multiple values
 -

 Key: NUTCH-377
 URL: https://issues.apache.org/jira/browse/NUTCH-377
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Stefan Neufeind

 Searches with boolean operators (AND or OR) are not (yet) possible. All 
 search-items are always searched with AND.
 But it would be nice to have the possibility to allow multiple values for a 
 certain field. Maybe that could done using a separator?
 As an example you might want to search for:
 somewordsite:www.example.org|www.apache.org
 Which (to my understand) would allow to search for one or more words with a 
 restriction to those two sites. It would prevent having to implement AND and 
 OR fully (maybe even including brackets) but would allow to cover a few often 
 used cases imho.
 Easy/hard to do? To my understanding Lucene itself allows AND/OR-searches. So 
 might basically be a problem of string-parsing and query-building towards 
 Lucene?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-453) Move stop words to a config file

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-453:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Move stop words to a config file
 

 Key: NUTCH-453
 URL: https://issues.apache.org/jira/browse/NUTCH-453
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Reporter: Steve Severance
Priority: Minor

 Move the stop words from the code to a config file. This will allow the stop 
 words to be modified without recompiling the code. The format could be the 
 same as the regex-urlfilter where regexs are used to define the words or a 
 plain text file of words could be used. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-542) Null Pointer Exception on getSummary when segment no longer exists

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-542:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Null Pointer Exception on getSummary when segment no longer exists
 --

 Key: NUTCH-542
 URL: https://issues.apache.org/jira/browse/NUTCH-542
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: ubuntu, tomcat5.5
Reporter: Jeff V.
Priority: Minor

 If the index refers to a search result in a given segment, but that segment 
 directory does not exist (has been deleted for some reason) the search.jsp 
 will return a completely blank page because a Null Pointer Exception is being 
 thrown from getSummary. At the very least it would be nice to get a more 
 friendly log message such as segment doesn't exist. But ideally the search 
 should continue with just omitting the non-existent results.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-466) Flexible segment format

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-466:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Flexible segment format
 ---

 Key: NUTCH-466
 URL: https://issues.apache.org/jira/browse/NUTCH-466
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: ParseFilters.java, segmentparts.patch


 In many situations it is necessary to store more data associated with pages 
 than it's possible now with the current segment format. Quite often it's a 
 binary data. There are two common workarounds for this: one is to use 
 per-page metadata, either in Content or ParseData, the other is to use an 
 external independent database using page ID-s as foreign keys.
 Currently segments can consist of the following predefined parts: content, 
 crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I 
 propose a third option, which is a natural extension of this existing segment 
 format, i.e. to introduce the ability to add arbitrarily named segment 
 parts, with the only requirement that they should be MapFile-s that store 
 Writable keys and values. Alternatively, we could define a 
 SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
 Existing segment API and searcher API (NutchBean, DistributedSearch 
 Client/Server) should be extended to handle such arbitrary parts.
 Example applications:
 * storing HTML previews of non-HTML pages, such as PDF, PS and Office 
 documents
 * storing pre-tokenized version of plain text for faster snippet generation
 * storing linguistically tagged text for sophisticated data mining
 * storing image thumbnails
 etc, etc ...
 I'm going to prepare a patchset shortly. Any comments and suggestions are 
 welcome.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-480) Searching multiple indexes with a single nutch instance

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-480:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Searching multiple indexes with a single nutch instance
 ---

 Key: NUTCH-480
 URL: https://issues.apache.org/jira/browse/NUTCH-480
 Project: Nutch
  Issue Type: Improvement
  Components: searcher, web gui
Affects Versions: 0.8
 Environment: Linux and Windows
Reporter: Ravi Chintakunta
 Attachments: nutch.zip


 Searching across multiple indexes with a single instance of Nutch is a cool 
 feature improvement. I had this requirement for my production site, where we 
 wanted to list the available categories (indexes) to search as check boxes 
 and the user could select any combination of indexes to search.  The results 
 page also displays the number of hits in each index.
 To do this:
 - I modified web.xml to include the paths to various search indexes
 - Modified Nutch.java to read all the indexes and create IndexReaders
 - Modified IndexSearcher.java to handle multiple IndexReaders
 In the attached file you will find the patch to the Nutch 0.8 code base and 
 also the newly added files:
 - SearchServlet - a servlet that is the web interface for search. This is 
 simplified version of jsp versions (without the i18n) and outputs the results 
 in text, xml or json format.
 - SearchConstants - an interface for messages and constants
 Please note that the patch includes the functionality for spell check - aka 
 Did you mean?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-470) Adding optional terms to a query

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-470:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Adding optional terms to a query
 

 Key: NUTCH-470
 URL: https://issues.apache.org/jira/browse/NUTCH-470
 Project: Nutch
  Issue Type: Wish
  Components: searcher
Affects Versions: 0.9.0
 Environment: Any
Reporter: Trond Andersen
Priority: Minor
 Attachments: optional.patch


 I'm missing API to add optional terms in the query class. Made a small 
 adjustment to the API to support this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-541) Index url field untokenized

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-541:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Index url field untokenized
 ---

 Key: NUTCH-541
 URL: https://issues.apache.org/jira/browse/NUTCH-541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the 
 untokenized version of the url field in some contexts : 
 1. For deleting duplicates by url (at search time). see NUTCH-455
 2. For restricting the search to a certain url (may be used in the case of 
 RSS search where each entry in the Rss is added as a distinct document with 
 (possibly) same url ) 
query-url extends FieldQueryFilter so: 
 Query: url:http://www.apache.org/
 Parsed: url:http http-www http-www-apache www www-apache apache org
 Translated: +url:http-http-www http-www-http-www-apache 
 http-www-apache-www www-www-apache www-apache apache org
 3. for accessing a document(s) in the search servers in the search servers. 
 (using query plugin)
 I suggest we add url as in index-basic and implement a query-url-untoken 
 plugin. 
 doc.add(new Field(url, url.toString(), Field.Store.YES, 
 Field.Index.TOKENIZED));
 doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, 
 Field.Index.UN_TOKENIZED));

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-72) Query basic filter with correction feature

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-72:
---


Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Query basic filter with correction feature
 --

 Key: NUTCH-72
 URL: https://issues.apache.org/jira/browse/NUTCH-72
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
 Environment: lucene
Reporter: Christophe Noel
 Attachments: querycorrectionplugin.zip


 This plugin improves query-basic plugin with a correction feature.
 Lucene includes FuzzyQuery feature which consists of searching not only for 
 matching terms, but searching for very similar terms too.
 This plugin should be used instead of query-basic, for people looking for an 
 easy solution about users query requests correction.
 Correction Query Plugin can be used as follows :
 Solution 1 :  If you want to search for very similar terms, add 
 autocorrectionmod as the first term of the query (example : 'nutch engine' - 
 'autocorrectionmod nutch engine')
 Solution 2 : Create a new search.jsp page which include a correction 
 checkbox management (input type=checkbox name=autocorrection 
 value=true may automatically add 'autocorrectionmod' as the first term of 
 the query) 
 QueryFuzzy knows a big problem : it is very slow for large index !
 So Correction Query Plugin works as follows :
 - it is not useful for big indexes
 - it only works for 5 characters and more words
 - it only look for words matching with the 2 first characters (to improve 
 performance this should be set to 3/4)
 - it only works for 65 % matching suffixes (algorithm is levenstein)
 PLease give your opinion about it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-260) Three new plugins that parse, index and query meta tags defined in the configuration

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-260:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Three new plugins that parse, index and query meta tags defined in the 
 configuration
 

 Key: NUTCH-260
 URL: https://issues.apache.org/jira/browse/NUTCH-260
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.7.2
 Environment: Built and tested on Linux so far.
Reporter: Jake Vanderdray
Priority: Minor
 Attachments: nutch_customizations.tar


 These plugins allow you to define meta tags in you're nutch-site file that 
 you want to include in parseing, indexing and searching.  The query plugin 
 must replace query-basic.  The format for adding query terms to 
 nutch-site.xml is:
 property
   namemeta.names/name
   valuekeywords,recommended/value
   descriptionThis is a comma seperated list of meta tag names that will
   be parsed, indexed and searched against when parse-meta, index-meta and
   query-meta are used./description
 /property
 property
   namemeta.boosts/name
   value1.0,5.0/value
   descriptionComma seperated list of boost values when searching using
   query-meta.  The order of the values should match the order of meta.names.
   /description
 /property
 Meta tags found are assumed to have either a single value or be a comma 
 seperated list of values.  The values found are added to the index as lucene 
 keywords (i.e. meta name=keywords values=First Thing, Second Thing would 
 result in two keyword fields named keywords.  The first would countain 
 First Thing and the second would contain Second Thing).
 I had to replace the query-basic plugin in order to allow matches in the meta 
 fields to return hits even if there were no matches in any of the default 
 fields.  The query-basic field only returns hits when every search term is 
 found in at least one default field.  I needed hits returned if matches were 
 found in at least one field for every term, and/or the entire search phrase 
 appeared in a meta index field.
 One known bug is that common terms are not getting stripped out of the 
 fields' values before they get indexed, so The Next Big Thing could not be 
 matched because the query engine will strip out the from all queries.  I 
 intend to fix this by stipping out common terms from meta fields before 
 indexing them.
 Another issue is that searching for Next Big Thing would not match meta 
 index values for Next, Big or Thing.  You can consider that a bug or a 
 feature depending on how you look at it.
 These plugins were written for and only work on the 0.7.2 branch.
 I'm going to attache a tarball of the source of these three plugins after I 
 create the issue.  To use the plugins, you'll need to untar them in your 
 src/plugins directory and add them to the ant build.xml directive (and of 
 course add them in your nutch-site.xml file).  If these end up getting added 
 to the project, I'll write up documentation on the wiki.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-445) Domain İndexing / Query Filter

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-445:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Domain İndexing / Query Filter
 --

 Key: NUTCH-445
 URL: https://issues.apache.org/jira/browse/NUTCH-445
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: TranslatingRawFieldQueryFilter_v1.0.patch, 
 index_query_domain_v1.0.patch, index_query_domain_v1.1.patch, 
 index_query_domain_v1.2.patch


 Hostname's contain information about the domain of th host, and all of the 
 subdomains. Indexing and Searching the domains are important for intuitive 
 behavior. 
 From DomainIndexingFilter javadoc : 
 Adds the domain(hostname) and all super domains to the index. 
  * br For http://lucene.apache.org/nutch/ the 
  * following will be added to the index : br 
  * ul
  * lilucene.apache.org /li
  * liapache/li
  * liorg /li
  * /ul
  * All hostnames are domain names, but not all the domain names are 
  * hostnames. In the above example hostname lucene is a 
  * subdomain of apache.org, which is itself a subdomain of 
  * org br
  * 
  
 Currently Basic indexing filter indexes the hostname in the site field, and 
 query-site plugin 
 allows to search in the site field. However site:apache.org will not return 
 http://lucene.apache.org
  By indexing the domain, we can be able to search domains. Unlike 
  the site field (indexed by BasicIndexingFilter) search, searching the 
  domain field allows us to retrieve lucene.apache.org to the query 
  apache.org. 
  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-820) Infinite loop when hitspersite is set

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-820:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Infinite loop when hitspersite is set
 -

 Key: NUTCH-820
 URL: https://issues.apache.org/jira/browse/NUTCH-820
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
Reporter: Xiao Yang

 NutchBean will re-search over and over, when the page number become large and 
 the excluded sites exceed MAX_PROHIBITED_TERMS.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-92) DistributedSearch incorrectly scores results

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-92?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-92:
---


Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 DistributedSearch incorrectly scores results
 

 Key: NUTCH-92
 URL: https://issues.apache.org/jira/browse/NUTCH-92
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: distributed-idf-v2.patch, distributed-idf.patch


 When running search servers in a distributed setup, using 
 DistributedSearch$Server and Client, total scores are incorrectly calculated. 
 The symptoms are that scores differ depending on how segments are deployed to 
 Servers, i.e. if there is uneven distribution of terms in segment indexes 
 (due to segment size or content differences) then scores will differ 
 depending on how many and which segments are deployed on a particular Server. 
 This may lead to prioritizing of non-relevant results over more relevant ones.
 The underlying reason for this is that each IndexSearcher (which uses local 
 index on each Server) calculates scores based on the local IDFs of query 
 terms, and not the global IDFs from all indexes together. This means that 
 scores arriving from different Servers to the Client cannot be meaningfully 
 compared, unless all indexes have similar distribution of Terms and similar 
 numbers of documents in them. However, currently the Client mixes all scores 
 together, sorts them by absolute values and picks top hits. These absolute 
 values will change if segments are un-evenly deployed to Servers.
 Currently the workaround is to deploy the same number of documents in 
 segments per Server, and to ensure that segments contain well-randomized 
 content so that term frequencies for common terms are very similar.
 The solution proposed here (as a result of discussion between ab and cutting, 
 patches are coming) is to calculate global IDFs prior to running the query, 
 and pre-boost query Terms with these global IDFs. This will require one more 
 RPC call per each query (this can be optimized later, e.g. through caching). 
 Then the scores will become normalized according to the global IDFs, and 
 Client will be able to meaningfully compare them. Scores will also become 
 independent of the segment content or local number of documents per Server. 
 This will involve at least the following changes:
 * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This 
 enables us to manipulate scores independently of local IDFs.
 * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which 
 will return document frequencies for query terms.
 * modify getSegmentNames() so that it returns also the total number of 
 documents in each segment, or implement this as a separate method (this will 
 be called once during segment init)
 * in DistributedSearch$Client.search() first make a call to servers to return 
 local IDFs for the current query, and calculate global IDFs for each relevant 
 Term in that query.
 * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and 
 PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for 
 all of its terms
 This solution should be applicable with only minor changes to all branches, 
 but initially the patches will be relative to trunk/ .
 Comments, suggestions and review are welcome!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-573) Multiple Domains - Query Search

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-573:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Multiple Domains - Query Search
 ---

 Key: NUTCH-573
 URL: https://issues.apache.org/jira/browse/NUTCH-573
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
 Environment: All
Reporter: Rajasekar Karthik
Assignee: Enis Soztutar
 Attachments: multiTermQuery_v1.patch


 Searching multiple domains can be done on Lucene - nut not that efficiently 
 on nutch.
 Query:
 +content:abc +(sitewww.aaa.com site:www.bbb.com)
 works on lucene but the same concept does not work on nutch.
 In Lucene, it works with 
 org.apache.lucene.analysis.KeywordAnalyzer
 org.apache.lucene.analysis.standard.StandardAnalyzer 
 but NOT on
 org.apache.lucene.analysis.SimpleAnalyzer 
 Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
 workaround to make this work? Is there an option to change what analyzer 
 nutch is using? 
 Just FYI, another solution (inefficient I believe) which seems to be working 
 on nutch
 query -site:ccc.com -site:ddd.com 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-764) Add support for vfsfile:// loading of plugins for JBoss

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-764:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Add support for vfsfile:// loading of plugins for JBoss
 ---

 Key: NUTCH-764
 URL: https://issues.apache.org/jira/browse/NUTCH-764
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
 Environment: JBoss AS 5.1.0
Reporter: tcur...@approachingpi.com
Priority: Trivial

 In the file:
 /src/java/org/apache/nutch/plugin/PluginManifestParser.java
 There is a check to make sure that the plugin file location is a url 
 formatted like file://path/plugins.
 When deployed on Jboss, the file protocol will sometimes be: 
 vfsfile://path/plugins.  The code with vfsfile can operate the same so I 
 propose a change to the check to also allow this protocol.  This would allow 
 Nutch to be deployed on the newer versions of JBoss without any modification.
 Here is a simple patch:
 Index: src/java/org/apache/nutch/plugin/PluginManifestParser.java
 ===
 --- src/java/org/apache/nutch/plugin/PluginManifestParser.javaMon Nov 
 09 20:20:51 EST 2009
 +++ src/java/org/apache/nutch/plugin/PluginManifestParser.javaMon Nov 
 09 20:20:51 EST 2009
 @@ -121,7 +121,8 @@
} else if (url == null) {
  LOG.warn(Plugins: directory not found:  + name);
  return null;
 -  } else if (!file.equals(url.getProtocol())) {
 +  } else if (!file.equals(url.getProtocol()) 
 +!vfsfile.equals(url.getProtocol())) {
  LOG.warn(Plugins: not a file: url. Can't load plugins from:  + 
 url);
  return null;
}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-455) dedup on tokenized fields is faulty

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-455:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 dedup on tokenized fields is faulty
 ---

 Key: NUTCH-455
 URL: https://issues.apache.org/jira/browse/NUTCH-455
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: IndexSearcherCacheWarm.patch


 (From LUCENE-252) 
 nutch uses several index servers, and the search results from these servers 
 are merged using a dedup field for for deleting duplicates. The values from 
 this field is cached by Lucene's FieldCachImpl. The default is the site 
 field, which is indexed and tokenized. However for a Tokenized Field (for 
 example url in nutch), FieldCacheImpl returns an array of Terms rather that 
 array of field values, so dedup'ing becomes faulty. Current FieldCache 
 implementation does not respect tokenized fields , and as described above 
 caches only terms. 
 So in the situation that we are searching using url as the dedup field, 
 when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
 the url (such as www or com) rather that the whole url. This prevents 
 using tokenized fields in the dedup field. 
 I have written a patch for lucene and attached it in 
 http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
 aforementioned issue about tokenized field caching. However building such a 
 cache for about 1.5M documents takes 20+ secs. The code in 
 IndexSearcher.translateHits() starts with
 if (dedupField != null) 
   dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
 and for the first call of search in IndexSearcher, cache is built. 
 Long story short, i have written a patch against IndexSearcher, which in 
 constructor warms-up the caches of wanted fields(configurable). I think we 
 should vote for LUCENE-252, and then commit the above patch with the last 
 version of lucene.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-708) NutchBean: OOM due to searcher.max.hits and dedup.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-708:



Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 NutchBean: OOM due to searcher.max.hits and dedup.
 --

 Key: NUTCH-708
 URL: https://issues.apache.org/jira/browse/NUTCH-708
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
 Environment: Ubuntu Linux, Java 5.
Reporter: Aaron Binns

 When searching an index we built for the National Archives, this one in 
 particular: http://webharvest.gov/collections/congress110th/
 We ran into an interesting situation.
 We were using searcher.max.hits=1000 in order to get faster searches.  Since 
 our index is sorted, the best documents are at the front and setting 
 searcher.max.hits=1000 would give us a nice trade-off of search quality vs. 
 response time.
 What I discovered was that with dedup (on site) enabled, we would get into 
 this loop where the searcher.max.hits would limit the raw hits to 1000 and 
 the deduplication code would get to the end of those 1000 results and still 
 need more as it hadn't found enough de-dup'd results to satisfy the query.
 The first 6 pages of results would be fine, but when we got to page 7, the 
 NutchBean would need more than 1000 raw results in order to get 60 de-duped 
 results.
 The code:
 for (int rawHitNum = 0; rawHitNum  hits.getTotal(); rawHitNum++) {
   // get the next raw hit 
   
  
   if (rawHitNum = hits.getLength())
 {
 // optimize query by prohibiting more matches on some excluded values 
   
  
 Query optQuery = (Query)query.clone();
 for (int i = 0; i  excludedValues.size(); i++) {
   if (i == MAX_PROHIBITED_TERMS)
 break;
   optQuery.addProhibitedTerm(((String)excludedValues.get(i)),
  dedupField);
 }
 numHitsRaw = (int)(numHitsRaw * rawHitsFactor);
 if (LOG.isInfoEnabled()) {
   LOG.info(re-searching for +numHitsRaw+ raw hits, query: 
 +optQuery);
 }
 hits = searcher.search(optQuery, numHitsRaw,
dedupField, sortField, reverse);
 if (LOG.isInfoEnabled()) {
   LOG.info(found +hits.getTotal()+ raw hits);
 }
 rawHitNum = -1;
 continue;
   }
 The loop constraints were never satisfied as rawHitNum and hits.getLength() 
 are capped by searcher.max.hits (1000).  The numHitsRaw keeps increasing by a 
 factor of 2 (rawHitsFactor) until it gets to 2^31 or so and deep down in the 
 search library code an array is allocated using that value as the size and 
 you get an OOM.
 We worked around the problem by abandoning the use of searcher.max.hits.  I 
 suppose we could have increased the value, but the index was small enough 
 (~10GB) that disabling searcher.max.hits didn't degrade the response time too 
 much.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-72) Query basic filter with correction feature

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-72.
--

Resolution: Won't Fix

 Query basic filter with correction feature
 --

 Key: NUTCH-72
 URL: https://issues.apache.org/jira/browse/NUTCH-72
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
 Environment: lucene
Reporter: Christophe Noel
 Attachments: querycorrectionplugin.zip


 This plugin improves query-basic plugin with a correction feature.
 Lucene includes FuzzyQuery feature which consists of searching not only for 
 matching terms, but searching for very similar terms too.
 This plugin should be used instead of query-basic, for people looking for an 
 easy solution about users query requests correction.
 Correction Query Plugin can be used as follows :
 Solution 1 :  If you want to search for very similar terms, add 
 autocorrectionmod as the first term of the query (example : 'nutch engine' - 
 'autocorrectionmod nutch engine')
 Solution 2 : Create a new search.jsp page which include a correction 
 checkbox management (input type=checkbox name=autocorrection 
 value=true may automatically add 'autocorrectionmod' as the first term of 
 the query) 
 QueryFuzzy knows a big problem : it is very slow for large index !
 So Correction Query Plugin works as follows :
 - it is not useful for big indexes
 - it only works for 5 characters and more words
 - it only look for words matching with the 2 first characters (to improve 
 performance this should be set to 3/4)
 - it only works for 65 % matching suffixes (algorithm is levenstein)
 PLease give your opinion about it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-294) Topic-maps of related searchwords

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-294.
---

Resolution: Won't Fix

 Topic-maps of related searchwords
 -

 Key: NUTCH-294
 URL: https://issues.apache.org/jira/browse/NUTCH-294
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Reporter: Stefan Neufeind

 Would it be possible to offer a user  topic-maps? It's when you search for 
 something and get topic-related words that might also be of interest for you. 
 I wonder if that's somehow possible with the ngram-index for did you mean 
 (see separate feature-enhancement-bug for this), but we'd need to have a 
 relation between words (in what context do they occur).
 For the webfrontend usually trees are used  - which for some users offer 
 quite impressive eye-candy :-) E.g. see this advertisement by Novell where 
 I've just seen a similar topic-map as well:
 http://www.novell.com/de-de/company/advertising/defineyouropen.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-943) Search Results default dedup field site should be stored in index.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-943.
---

Resolution: Won't Fix

 Search Results default dedup field site should be stored in index.
 

 Key: NUTCH-943
 URL: https://issues.apache.org/jira/browse/NUTCH-943
 Project: Nutch
  Issue Type: Bug
  Components: indexer, searcher
Affects Versions: 1.2
Reporter: Charan Malemarpuram
 Attachments: NUTCH-943.patch


 site is not configured as a stored field in SOLR schema.
 Search returns only two results always and had See More Hits button, even 
 if the results are from different sites.
 See More
 Attached patch changes the default schema.xml config to store site field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-540) some problem about the Nutch cache

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-540.
---

Resolution: Won't Fix

 some problem about the Nutch cache
 --

 Key: NUTCH-540
 URL: https://issues.apache.org/jira/browse/NUTCH-540
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
Reporter: crossany
 Attachments: 1.gif, 1186733525.jpg


 I'am a chinese.
 I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on 
 linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website 
 it a chinese website the web charset it's also UTF-8. when Use the nutch on 
 tomcat for search chinese word , I find the search result' Title and 
 description was right to display. but when I click the cache, the cache web 
 was display a error charset code, I see the cache
 web' charset also utf-8. I find a website use Nutch 
 http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also 
 error.
 I use Luke to see the segments It's can display chinese word, I think maybe 
 it's a Bug.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-469.
---

Resolution: Won't Fix

 changes to geoPosition plugin to make it work on nutch 0.9
 --

 Key: NUTCH-469
 URL: https://issues.apache.org/jira/browse/NUTCH-469
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Mike Schwartz
 Attachments: NUTCH-469-2007-05-09.txt.gz, geoPosition-0.5.tgz, 
 geoPosition0.6_cdiff.zip


 I have modified the geoPosition plugin 
 (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9.  (The 
 code was built originally using nutch 0.7.)  I'd like to contribute my 
 changes back to the nutch project.  I already communicated with the code's 
 author (Matthias Jaekle), and he agrees with my mods.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-92) DistributedSearch incorrectly scores results

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-92?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-92.
--

Resolution: Won't Fix

 DistributedSearch incorrectly scores results
 

 Key: NUTCH-92
 URL: https://issues.apache.org/jira/browse/NUTCH-92
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: distributed-idf-v2.patch, distributed-idf.patch


 When running search servers in a distributed setup, using 
 DistributedSearch$Server and Client, total scores are incorrectly calculated. 
 The symptoms are that scores differ depending on how segments are deployed to 
 Servers, i.e. if there is uneven distribution of terms in segment indexes 
 (due to segment size or content differences) then scores will differ 
 depending on how many and which segments are deployed on a particular Server. 
 This may lead to prioritizing of non-relevant results over more relevant ones.
 The underlying reason for this is that each IndexSearcher (which uses local 
 index on each Server) calculates scores based on the local IDFs of query 
 terms, and not the global IDFs from all indexes together. This means that 
 scores arriving from different Servers to the Client cannot be meaningfully 
 compared, unless all indexes have similar distribution of Terms and similar 
 numbers of documents in them. However, currently the Client mixes all scores 
 together, sorts them by absolute values and picks top hits. These absolute 
 values will change if segments are un-evenly deployed to Servers.
 Currently the workaround is to deploy the same number of documents in 
 segments per Server, and to ensure that segments contain well-randomized 
 content so that term frequencies for common terms are very similar.
 The solution proposed here (as a result of discussion between ab and cutting, 
 patches are coming) is to calculate global IDFs prior to running the query, 
 and pre-boost query Terms with these global IDFs. This will require one more 
 RPC call per each query (this can be optimized later, e.g. through caching). 
 Then the scores will become normalized according to the global IDFs, and 
 Client will be able to meaningfully compare them. Scores will also become 
 independent of the segment content or local number of documents per Server. 
 This will involve at least the following changes:
 * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This 
 enables us to manipulate scores independently of local IDFs.
 * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which 
 will return document frequencies for query terms.
 * modify getSegmentNames() so that it returns also the total number of 
 documents in each segment, or implement this as a separate method (this will 
 be called once during segment init)
 * in DistributedSearch$Client.search() first make a call to servers to return 
 local IDFs for the current query, and calculate global IDFs for each relevant 
 Term in that query.
 * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and 
 PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for 
 all of its terms
 This solution should be applicable with only minor changes to all branches, 
 but initially the patches will be relative to trunk/ .
 Comments, suggestions and review are welcome!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-674) NutchBean doesn't check for searcher.dir existance.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-674.
---

Resolution: Won't Fix

 NutchBean doesn't check for searcher.dir existance.
 ---

 Key: NUTCH-674
 URL: https://issues.apache.org/jira/browse/NUTCH-674
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: Looks like platform independent problem.
Reporter: Kuba Kończyk

 If searcher.dir doesn't exists or it's not accessible, searcher will just 
 continue and report that there is 0 hits found.It should throw an exception 
 or log an error instead.As an starting point, there was a patch proposed some 
 time ago on Nuch-dev: 
 http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg09422.html
  to solve this problem.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-820) Infinite loop when hitspersite is set

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-820.
---

Resolution: Won't Fix

 Infinite loop when hitspersite is set
 -

 Key: NUTCH-820
 URL: https://issues.apache.org/jira/browse/NUTCH-820
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
Reporter: Xiao Yang

 NutchBean will re-search over and over, when the page number become large and 
 the excluded sites exceed MAX_PROHIBITED_TERMS.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-708) NutchBean: OOM due to searcher.max.hits and dedup.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-708.
---

Resolution: Won't Fix

 NutchBean: OOM due to searcher.max.hits and dedup.
 --

 Key: NUTCH-708
 URL: https://issues.apache.org/jira/browse/NUTCH-708
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
 Environment: Ubuntu Linux, Java 5.
Reporter: Aaron Binns

 When searching an index we built for the National Archives, this one in 
 particular: http://webharvest.gov/collections/congress110th/
 We ran into an interesting situation.
 We were using searcher.max.hits=1000 in order to get faster searches.  Since 
 our index is sorted, the best documents are at the front and setting 
 searcher.max.hits=1000 would give us a nice trade-off of search quality vs. 
 response time.
 What I discovered was that with dedup (on site) enabled, we would get into 
 this loop where the searcher.max.hits would limit the raw hits to 1000 and 
 the deduplication code would get to the end of those 1000 results and still 
 need more as it hadn't found enough de-dup'd results to satisfy the query.
 The first 6 pages of results would be fine, but when we got to page 7, the 
 NutchBean would need more than 1000 raw results in order to get 60 de-duped 
 results.
 The code:
 for (int rawHitNum = 0; rawHitNum  hits.getTotal(); rawHitNum++) {
   // get the next raw hit 
   
  
   if (rawHitNum = hits.getLength())
 {
 // optimize query by prohibiting more matches on some excluded values 
   
  
 Query optQuery = (Query)query.clone();
 for (int i = 0; i  excludedValues.size(); i++) {
   if (i == MAX_PROHIBITED_TERMS)
 break;
   optQuery.addProhibitedTerm(((String)excludedValues.get(i)),
  dedupField);
 }
 numHitsRaw = (int)(numHitsRaw * rawHitsFactor);
 if (LOG.isInfoEnabled()) {
   LOG.info(re-searching for +numHitsRaw+ raw hits, query: 
 +optQuery);
 }
 hits = searcher.search(optQuery, numHitsRaw,
dedupField, sortField, reverse);
 if (LOG.isInfoEnabled()) {
   LOG.info(found +hits.getTotal()+ raw hits);
 }
 rawHitNum = -1;
 continue;
   }
 The loop constraints were never satisfied as rawHitNum and hits.getLength() 
 are capped by searcher.max.hits (1000).  The numHitsRaw keeps increasing by a 
 factor of 2 (rawHitsFactor) until it gets to 2^31 or so and deep down in the 
 search library code an array is allocated using that value as the size and 
 you get an OOM.
 We worked around the problem by abandoning the use of searcher.max.hits.  I 
 suppose we could have increased the value, but the index was small enough 
 (~10GB) that disabling searcher.max.hits didn't degrade the response time too 
 much.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-445) Domain İndexing / Query Filter

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-445.
---

Resolution: Won't Fix

 Domain İndexing / Query Filter
 --

 Key: NUTCH-445
 URL: https://issues.apache.org/jira/browse/NUTCH-445
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: TranslatingRawFieldQueryFilter_v1.0.patch, 
 index_query_domain_v1.0.patch, index_query_domain_v1.1.patch, 
 index_query_domain_v1.2.patch


 Hostname's contain information about the domain of th host, and all of the 
 subdomains. Indexing and Searching the domains are important for intuitive 
 behavior. 
 From DomainIndexingFilter javadoc : 
 Adds the domain(hostname) and all super domains to the index. 
  * br For http://lucene.apache.org/nutch/ the 
  * following will be added to the index : br 
  * ul
  * lilucene.apache.org /li
  * liapache/li
  * liorg /li
  * /ul
  * All hostnames are domain names, but not all the domain names are 
  * hostnames. In the above example hostname lucene is a 
  * subdomain of apache.org, which is itself a subdomain of 
  * org br
  * 
  
 Currently Basic indexing filter indexes the hostname in the site field, and 
 query-site plugin 
 allows to search in the site field. However site:apache.org will not return 
 http://lucene.apache.org
  By indexing the domain, we can be able to search domains. Unlike 
  the site field (indexed by BasicIndexingFilter) search, searching the 
  domain field allows us to retrieve lucene.apache.org to the query 
  apache.org. 
  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-541) Index url field untokenized

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-541.
---

Resolution: Won't Fix

 Index url field untokenized
 ---

 Key: NUTCH-541
 URL: https://issues.apache.org/jira/browse/NUTCH-541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the 
 untokenized version of the url field in some contexts : 
 1. For deleting duplicates by url (at search time). see NUTCH-455
 2. For restricting the search to a certain url (may be used in the case of 
 RSS search where each entry in the Rss is added as a distinct document with 
 (possibly) same url ) 
query-url extends FieldQueryFilter so: 
 Query: url:http://www.apache.org/
 Parsed: url:http http-www http-www-apache www www-apache apache org
 Translated: +url:http-http-www http-www-http-www-apache 
 http-www-apache-www www-www-apache www-apache apache org
 3. for accessing a document(s) in the search servers in the search servers. 
 (using query plugin)
 I suggest we add url as in index-basic and implement a query-url-untoken 
 plugin. 
 doc.add(new Field(url, url.toString(), Field.Store.YES, 
 Field.Index.TOKENIZED));
 doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, 
 Field.Index.UN_TOKENIZED));

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-455) dedup on tokenized fields is faulty

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-455.
---

Resolution: Won't Fix

 dedup on tokenized fields is faulty
 ---

 Key: NUTCH-455
 URL: https://issues.apache.org/jira/browse/NUTCH-455
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: IndexSearcherCacheWarm.patch


 (From LUCENE-252) 
 nutch uses several index servers, and the search results from these servers 
 are merged using a dedup field for for deleting duplicates. The values from 
 this field is cached by Lucene's FieldCachImpl. The default is the site 
 field, which is indexed and tokenized. However for a Tokenized Field (for 
 example url in nutch), FieldCacheImpl returns an array of Terms rather that 
 array of field values, so dedup'ing becomes faulty. Current FieldCache 
 implementation does not respect tokenized fields , and as described above 
 caches only terms. 
 So in the situation that we are searching using url as the dedup field, 
 when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
 the url (such as www or com) rather that the whole url. This prevents 
 using tokenized fields in the dedup field. 
 I have written a patch for lucene and attached it in 
 http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
 aforementioned issue about tokenized field caching. However building such a 
 cache for about 1.5M documents takes 20+ secs. The code in 
 IndexSearcher.translateHits() starts with
 if (dedupField != null) 
   dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
 and for the first call of search in IndexSearcher, cache is built. 
 Long story short, i have written a patch against IndexSearcher, which in 
 constructor warms-up the caches of wanted fields(configurable). I think we 
 should vote for LUCENE-252, and then commit the above patch with the last 
 version of lucene.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-638) Launching Distributed Searchers with URI indicating filesystem to use rather than relying on hadoop config files.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-638.
---

Resolution: Won't Fix

 Launching Distributed Searchers with URI indicating filesystem to use rather 
 than relying on hadoop config files.
 -

 Key: NUTCH-638
 URL: https://issues.apache.org/jira/browse/NUTCH-638
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Aaron Nall
Priority: Minor
 Attachments: distributed-search-uri.patch

   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 I wanted to conduct all index creation operations in hdfs but search from the 
 local file system using the same cluster of machines.  I believe that this is 
 a common use case.  
 This required either a parallel nutch install or edits (scripted or manual) 
 to hadoop-site.xml to change the file system from hdfs to local when starting 
 a distributed searcher service.  This minor patch makes IndexSearcher and 
 NutchBean honor URIs as supported by hadoop 0.17 without altering existing 
 functionality if simple paths are entered.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-479) Support for OR queries

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-479.
---

Resolution: Won't Fix

 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: nutch_0.9_OR.patch, or.patch, or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-466) Flexible segment format

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-466.
---

Resolution: Won't Fix

 Flexible segment format
 ---

 Key: NUTCH-466
 URL: https://issues.apache.org/jira/browse/NUTCH-466
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: ParseFilters.java, segmentparts.patch


 In many situations it is necessary to store more data associated with pages 
 than it's possible now with the current segment format. Quite often it's a 
 binary data. There are two common workarounds for this: one is to use 
 per-page metadata, either in Content or ParseData, the other is to use an 
 external independent database using page ID-s as foreign keys.
 Currently segments can consist of the following predefined parts: content, 
 crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I 
 propose a third option, which is a natural extension of this existing segment 
 format, i.e. to introduce the ability to add arbitrarily named segment 
 parts, with the only requirement that they should be MapFile-s that store 
 Writable keys and values. Alternatively, we could define a 
 SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
 Existing segment API and searcher API (NutchBean, DistributedSearch 
 Client/Server) should be extended to handle such arbitrary parts.
 Example applications:
 * storing HTML previews of non-HTML pages, such as PDF, PS and Office 
 documents
 * storing pre-tokenized version of plain text for faster snippet generation
 * storing linguistically tagged text for sophisticated data mining
 * storing image thumbnails
 etc, etc ...
 I'm going to prepare a patchset shortly. Any comments and suggestions are 
 welcome.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-377) Add possibility to search for multiple values

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-377.
---

Resolution: Won't Fix

 Add possibility to search for multiple values
 -

 Key: NUTCH-377
 URL: https://issues.apache.org/jira/browse/NUTCH-377
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Stefan Neufeind

 Searches with boolean operators (AND or OR) are not (yet) possible. All 
 search-items are always searched with AND.
 But it would be nice to have the possibility to allow multiple values for a 
 certain field. Maybe that could done using a separator?
 As an example you might want to search for:
 somewordsite:www.example.org|www.apache.org
 Which (to my understand) would allow to search for one or more words with a 
 restriction to those two sites. It would prevent having to implement AND and 
 OR fully (maybe even including brackets) but would allow to cover a few often 
 used cases imho.
 Easy/hard to do? To my understanding Lucene itself allows AND/OR-searches. So 
 might basically be a problem of string-parsing and query-building towards 
 Lucene?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-386) Plugin to index categories by url rules

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-386.
---

Resolution: Won't Fix

 Plugin to index categories by url rules
 ---

 Key: NUTCH-386
 URL: https://issues.apache.org/jira/browse/NUTCH-386
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Reporter: Ernesto De Santis
Priority: Minor
 Attachments: index-url-category-0.1.zip, index-url-category.jar


 The compressed zip has a install_notes.txt file with instructions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-453) Move stop words to a config file

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-453.
---

Resolution: Won't Fix

 Move stop words to a config file
 

 Key: NUTCH-453
 URL: https://issues.apache.org/jira/browse/NUTCH-453
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Reporter: Steve Severance
Priority: Minor

 Move the stop words from the code to a config file. This will allow the stop 
 words to be modified without recompiling the code. The format could be the 
 same as the regex-urlfilter where regexs are used to define the words or a 
 plain text file of words could be used. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-260) Three new plugins that parse, index and query meta tags defined in the configuration

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-260.
---

Resolution: Won't Fix

 Three new plugins that parse, index and query meta tags defined in the 
 configuration
 

 Key: NUTCH-260
 URL: https://issues.apache.org/jira/browse/NUTCH-260
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.7.2
 Environment: Built and tested on Linux so far.
Reporter: Jake Vanderdray
Priority: Minor
 Attachments: nutch_customizations.tar


 These plugins allow you to define meta tags in you're nutch-site file that 
 you want to include in parseing, indexing and searching.  The query plugin 
 must replace query-basic.  The format for adding query terms to 
 nutch-site.xml is:
 property
   namemeta.names/name
   valuekeywords,recommended/value
   descriptionThis is a comma seperated list of meta tag names that will
   be parsed, indexed and searched against when parse-meta, index-meta and
   query-meta are used./description
 /property
 property
   namemeta.boosts/name
   value1.0,5.0/value
   descriptionComma seperated list of boost values when searching using
   query-meta.  The order of the values should match the order of meta.names.
   /description
 /property
 Meta tags found are assumed to have either a single value or be a comma 
 seperated list of values.  The values found are added to the index as lucene 
 keywords (i.e. meta name=keywords values=First Thing, Second Thing would 
 result in two keyword fields named keywords.  The first would countain 
 First Thing and the second would contain Second Thing).
 I had to replace the query-basic plugin in order to allow matches in the meta 
 fields to return hits even if there were no matches in any of the default 
 fields.  The query-basic field only returns hits when every search term is 
 found in at least one default field.  I needed hits returned if matches were 
 found in at least one field for every term, and/or the entire search phrase 
 appeared in a meta index field.
 One known bug is that common terms are not getting stripped out of the 
 fields' values before they get indexed, so The Next Big Thing could not be 
 matched because the query engine will strip out the from all queries.  I 
 intend to fix this by stipping out common terms from meta fields before 
 indexing them.
 Another issue is that searching for Next Big Thing would not match meta 
 index values for Next, Big or Thing.  You can consider that a bug or a 
 feature depending on how you look at it.
 These plugins were written for and only work on the 0.7.2 branch.
 I'm going to attache a tarball of the source of these three plugins after I 
 create the issue.  To use the plugins, you'll need to untar them in your 
 src/plugins directory and add them to the ant build.xml directive (and of 
 course add them in your nutch-site.xml file).  If these end up getting added 
 to the project, I'll write up documentation on the wiki.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-542) Null Pointer Exception on getSummary when segment no longer exists

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-542.
---

Resolution: Won't Fix

 Null Pointer Exception on getSummary when segment no longer exists
 --

 Key: NUTCH-542
 URL: https://issues.apache.org/jira/browse/NUTCH-542
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: ubuntu, tomcat5.5
Reporter: Jeff V.
Priority: Minor

 If the index refers to a search result in a given segment, but that segment 
 directory does not exist (has been deleted for some reason) the search.jsp 
 will return a completely blank page because a Null Pointer Exception is being 
 thrown from getSummary. At the very least it would be nice to get a more 
 friendly log message such as segment doesn't exist. But ideally the search 
 should continue with just omitting the non-existent results.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-355) The title of query result could like the summary have the highlight??

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-355.
---

Resolution: Won't Fix

 The title of query result  could like the summary have the highlight??
 --

 Key: NUTCH-355
 URL: https://issues.apache.org/jira/browse/NUTCH-355
 Project: Nutch
  Issue Type: Wish
  Components: searcher
Affects Versions: 0.8, 1.0.0
 Environment: all
Reporter: King Kong
Priority: Minor

 I'd like to make the title hightlight, but i can't found how to do it .
 when i query Nutch , the result must like this:
 a href=http://lucene.apache.org/nutch/; Welcome to bNutch/b!  /a  
 This is the first bNutch/b release as an Apache Lucene sub-project. See 
 CHANGES.txt for details. The release is available here. ... bNutch/bhas 
 now graduated from the Apache incubator, and is now a Subproject of Lucene. 
 ...
  
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-470) Adding optional terms to a query

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-470.
---

Resolution: Won't Fix

 Adding optional terms to a query
 

 Key: NUTCH-470
 URL: https://issues.apache.org/jira/browse/NUTCH-470
 Project: Nutch
  Issue Type: Wish
  Components: searcher
Affects Versions: 0.9.0
 Environment: Any
Reporter: Trond Andersen
Priority: Minor
 Attachments: optional.patch


 I'm missing API to add optional terms in the query class. Made a small 
 adjustment to the API to support this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-764) Add support for vfsfile:// loading of plugins for JBoss

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-764.
---

Resolution: Won't Fix

 Add support for vfsfile:// loading of plugins for JBoss
 ---

 Key: NUTCH-764
 URL: https://issues.apache.org/jira/browse/NUTCH-764
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
 Environment: JBoss AS 5.1.0
Reporter: tcur...@approachingpi.com
Priority: Trivial

 In the file:
 /src/java/org/apache/nutch/plugin/PluginManifestParser.java
 There is a check to make sure that the plugin file location is a url 
 formatted like file://path/plugins.
 When deployed on Jboss, the file protocol will sometimes be: 
 vfsfile://path/plugins.  The code with vfsfile can operate the same so I 
 propose a change to the check to also allow this protocol.  This would allow 
 Nutch to be deployed on the newer versions of JBoss without any modification.
 Here is a simple patch:
 Index: src/java/org/apache/nutch/plugin/PluginManifestParser.java
 ===
 --- src/java/org/apache/nutch/plugin/PluginManifestParser.javaMon Nov 
 09 20:20:51 EST 2009
 +++ src/java/org/apache/nutch/plugin/PluginManifestParser.javaMon Nov 
 09 20:20:51 EST 2009
 @@ -121,7 +121,8 @@
} else if (url == null) {
  LOG.warn(Plugins: directory not found:  + name);
  return null;
 -  } else if (!file.equals(url.getProtocol())) {
 +  } else if (!file.equals(url.getProtocol()) 
 +!vfsfile.equals(url.getProtocol())) {
  LOG.warn(Plugins: not a file: url. Can't load plugins from:  + 
 url);
  return null;
}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-290.
---

Resolution: Won't Fix

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

 Key: NUTCH-290
 URL: https://issues.apache.org/jira/browse/NUTCH-290
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8
Reporter: Stefan Neufeind
 Attachments: NUTCH-290-canExtractContent.patch


 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-358) Language Switching PROBLEM FIXED

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-358.
---

Resolution: Won't Fix

 Language Switching PROBLEM FIXED
 

 Key: NUTCH-358
 URL: https://issues.apache.org/jira/browse/NUTCH-358
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 0.8
 Environment: Linx ubuntu 6.0.6
 jakarta-tomcat-5.0.28
 nutch-0.8
Reporter: David Podunavac
Priority: Trivial

 Language selection on bottom of page does not affect the result page.
 So if browser language config is set to e.g. en result page(search.jsp) 
 will be displayed in EN
 browsers language. NO matter what language has been selected (the locale 
 links of the bottom of page).
 request.getParameter=lang is useless as far as i can see
 So the links on bottom of the page does not translate the reslutpages 
 keywords.
 This must be a BUG 
 and shall be reported what i did now for that reason.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-389.
---

Resolution: Won't Fix

 a url tokenizer implementation for tokenizing index fields : url and host
 -

 Key: NUTCH-389
 URL: https://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor
 Attachments: urlTokenizer-improved.diff, urlTokenizer.diff


 NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
 seperators, which is in the case of the urls not appropriate. So i have 
 written a url tokenizer which the tokens that match the regular exp 
 [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
 which describes the grammer for URIs, URL's can be tokenized with the above 
 expression. 
 NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the 
 url, site and host fields.
 see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-396) mergesegs sorts URLs, making segments useless for subsequent fetch

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-396.
---

Resolution: Won't Fix

 mergesegs sorts URLs, making segments useless for subsequent fetch
 --

 Key: NUTCH-396
 URL: https://issues.apache.org/jira/browse/NUTCH-396
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8
 Environment: Mac OS X 10.4.7
Reporter: Doug Cook
Priority: Minor

 Mergesegs leaves the output segment in URL-sorted order.
 This is a problem if the segment was just generated and not yet fetched - the 
 fetcher likes the URLs to be in essentially random order (sort by URL hash or 
 similar). If I fetch a segment created by mergesegs, my performance is 
 extremely poor since all URLs from a given host will be grouped together and 
 the per-host delays kill me.
 I have a local fix which I am using: map using a key of MD5(URL) + URL, then, 
 during the reduce phase, chop the MD5 off the front to get the original URL. 
 This is simple, has essentially random order, no problems with collisions, 
 and seems to work nicely.
 The only thing I don't know is whether or not there is some other tool 
 expecting the sorted order (I would expect not, since generate does not 
 produce this). Right now I have my fix as an option (-randomize), but if 
 there is no other tool requiring sorted order, it's probably cleaner to just 
 make this non-optional.
 Thoughts?
  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-326) WordExtractor throws java.util.NoSuchElementException on some documents

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-326.
---

Resolution: Won't Fix

 WordExtractor throws java.util.NoSuchElementException on some documents
 ---

 Key: NUTCH-326
 URL: https://issues.apache.org/jira/browse/NUTCH-326
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.7.1, 0.7.2
Reporter: Tom Jensen
Priority: Minor

 At line 156 in org.apache.nutch.parse.msword.WordExtractor it will on 
 occassion throw a java.util.NoSuchElementException because there is no 
 checking as to whether or not the Iterator has been exhausted.  Suggest 
 adding this:
 if (!textIt.hasNext()) {
   break;
 }
 just before line 156.  Tested with problem word documents.  Results were 
 Exceptions no longer being thrown and text extracted successfully.  Other 
 documents that successfully had their text extracted previously continued to 
 do so.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-352) Add jar command to bin/nutch to allow launching hadoop job jars

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-352.
---

Resolution: Won't Fix

 Add jar command to bin/nutch to allow launching hadoop job jars
 ---

 Key: NUTCH-352
 URL: https://issues.apache.org/jira/browse/NUTCH-352
 Project: Nutch
  Issue Type: New Feature
Reporter: David Cathcart
Priority: Minor
 Attachments: nutch-jobjar.diff


 Add the ability to run hadoop job jars via bin/nutch jar jobjar.jar. See 
 attachment for patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-343) Index MP3 SHA1 hashes

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-343.
---

Resolution: Won't Fix

 Index MP3 SHA1 hashes
 -

 Key: NUTCH-343
 URL: https://issues.apache.org/jira/browse/NUTCH-343
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.8, 0.8.1, 0.9.0
Reporter: Hasan Diwan
 Attachments: parsemp3.pat


 Add indexing of the mp3s sha1 hash.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-26) New Http Authentication mechanism

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-26.
--

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 New Http Authentication mechanism
 -

 Key: NUTCH-26
 URL: https://issues.apache.org/jira/browse/NUTCH-26
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Stefan Groschupf
Priority: Trivial

 transferred from:
 http://sourceforge.net/tracker/index.php?func=detailaid=990560group_id=59548atid=491356
 submitted by:
 Matt
 Here's a patch and lib (commons-codec used for Base64 
 encoding) which implements hasic http authentication. 
 I've attempted to build it so we can add more 
 authentication methods at a later time.
 This also includes the previously discussed 
 MultiProperties class which allows multiple headers with 
 the same name (as opposed to Properties which allows 
 only a single).
 I believe both John and Doug have had some comments 
 on this.
 Matt

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-259) Problem in IndexSorter after dedup

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-259.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Problem in IndexSorter after dedup
 --

 Key: NUTCH-259
 URL: https://issues.apache.org/jira/browse/NUTCH-259
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Reporter: Michael
Priority: Minor

 When trying to run IndexSorter i'm getting an error:
 Exception in thread main java.lang.IllegalArgumentException: attempt to 
 access a deleted document
 at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:282)
 at 
 org.apache.lucene.index.FilterIndexReader.document(FilterIndexReader.java:104)
 at 
 org.apache.nutch.indexer.IndexSorter$SortingReader.document(IndexSorter.java:170)
 at 
 org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:186)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
 at 
 org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:579)
 at org.apache.nutch.indexer.IndexSorter.sort(IndexSorter.java:240)
 at org.apache.nutch.indexer.IndexSorter.main(IndexSorter.java:291)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-283) If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-283.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 If the Fetcher times out and abandons Fetcher Threads, severe errors will 
 occur on those Threads
 

 Key: NUTCH-283
 URL: https://issues.apache.org/jira/browse/NUTCH-283
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Scott Ganyo
 Attachments: patch.txt, patch.txt


 If a Fetcher has chosen to time out and has abandoned outstanding Fetcher 
 Threads, resources that those Fetcher Threads may be using are closed.  This 
 naturally causes any abandoned Fetcher Threads to fail when they later 
 attempt to finish up their work in progress.
 I have a patch that addresses this that I am attaching.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-158.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Process Sitemap data in text, rss or xml format as well as OAI-PMH
 --

 Key: NUTCH-158
 URL: https://issues.apache.org/jira/browse/NUTCH-158
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8
Reporter: byron miller
Priority: Minor

 Add support to the fetcher to look for sitemap files, download them and 
 process them into webdb.
 Perhaps create a robots.txt directive that can be used to create a standard 
 format for sitemaps in RSS, XML or text format (one line per url) and process 
 that.
 I would love to see someone stomp on proprietary sitemap features or making 
 things so google specific as they are today :)
 * RSS format/Atom Format (standard)
 * XML meta descroption
 * OAI-PMH meta description 
 (http://www.openarchives.org/OAI/openarchivesprotocol.html)
 Perhaps even a pre crawler that will scour for these to inject into the web 
 db to help build your link map so you could even just index topN.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-251) Administration GUI

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-251.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Administration GUI
 --

 Key: NUTCH-251
 URL: https://issues.apache.org/jira/browse/NUTCH-251
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Minor
 Attachments: Nutch-251-AdminGUI.tar.gz, hadoop_nutch_gui_v1.patch, 
 nutch_gui_plugins_v1.zip, nutch_gui_v1.patch


 Having a web based administration interface would help to make nutch 
 administration and management much more user friendly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-164) Locale (language) choice by first session has global effect to all sessions

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-164.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Locale (language) choice by first session has global effect to all sessions
 ---

 Key: NUTCH-164
 URL: https://issues.apache.org/jira/browse/NUTCH-164
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 0.7.1
 Environment: any
Reporter: KuroSaka TeruHiko

 Here's a report posted on nutch-users ML by Sergio [red...@redsun.homeip.net] 
 on 1/02/2006:
 
 I just installed nutch in a Fedora Core 3 server.
 Once installed, I crawled a small site to test it. I opened my navigator
 (mozilla 1.7 which reports by default ES-ES locales, and everything was ok).
 Then I asked a friend of mine  (the owner of the server) to test it. He did
 a search with an EN-US locale navigator, and the search page appeared in
 Spanish.
 After a few hours, I did the following: I restarted tomcat, I changed the
 locale of my mozilla to EN, and I opened the search page. Now I always get
 English search page even if I open with a mozilla ES-ES locale.
 I wrote a message to my friend:
 nutch keeps the locale of the first navigator that makes a request for all
 other requests. By this reason, yesterday as the first request was from my
 ES locale browser, you saw the page in Spanish with your browser that
 reports EN locale. There is a way to make this work:
 * Making sure that, after the server is restarted, the first request is done
 by a browser that reports EN locale.
 
 This happened in my environment too.  After taking a look the code, I believe 
 this is caused by
 use of the default message bundle in search.jsp.  The code snipplet looks 
 like:
 i18n:bundle baseName=org.nutch.jsp.search/
 ...
 titleNutch: i18n:message key=title//title
 ...
 The default message bundle probably has the application scope.  Because of 
 that, the first
 setting of the language has global effect to every session created afterward.
 The right fix is to limit the scope to the session by inserting the scope 
 specifier, as in:
 i18n:bundle scope=session baseName=org.nutch.jsp.search/
 Other JSP files need to be inspected for the same issue and should be fixed 
 as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-162) country code jp is used instead of language code ja for Japanese

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-162.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 country code jp is used instead of language code ja for Japanese
 

 Key: NUTCH-162
 URL: https://issues.apache.org/jira/browse/NUTCH-162
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 0.7.1
 Environment: n/a
Reporter: KuroSaka TeruHiko
Priority: Trivial
 Attachments: anchors_ja.properties, cached_ja.properties, 
 explain_ja.properties, search_ja.properties, text_ja.properties


 In locale switching link for Japanese, jp is used as language code but it 
 is an ISO country code.  The language code ja should be used.
 By the way, I don't think many users are familiar with the ISO language 
 codes.  A Canadian user may click on ca uknowoing that ca stands for 
 Catalan, not Canadian English or French. Rather than listing the language 
 code, listing the language names in the prospective languages may be better. 
 (I say may be because the browser could show some language names in 
 corrupted text if the current font does not support that language --- this is 
 a difficult problem.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-441) Thai Analyzer Plugin

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-441.
---

Resolution: Won't Fix

 Thai Analyzer Plugin
 

 Key: NUTCH-441
 URL: https://issues.apache.org/jira/browse/NUTCH-441
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Vee Satayamas
 Attachments: nutch-plugin-analysis-th-20070207.patch.gz


 This Thai analyzer plugin was created by coping and modifying the French 
 analyzer plugin. However, there is no Thai analyzer in 
 lucene-analyzers-2.0.0.jar (in lib-lucene-analyzers). Thus 
 lucene-analyzers-nightly.jar was used instead. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-224) Nutch doesn't handle Korean text at all

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-224.
---

Resolution: Won't Fix

 Nutch doesn't handle Korean text at all
 ---

 Key: NUTCH-224
 URL: https://issues.apache.org/jira/browse/NUTCH-224
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.7.1
Reporter: KuroSaka TeruHiko

 I was browing NutchAnalysis.jj and found that
 Hungul Syllables (U+AC00 ... U+D7AF; U+ means
 a Unicode character of the hex value ) are not
 part of LETTER or CJK class.  This seems to me that
 Nutch cannot handle Korean documents at all.
 I posted the above message at nutch-user ML and Cheolgoo Kang 
 [app...@gmail.com]
 replied as:
 
 There was similar issue with Lucene's StandardTokenizer.jj.
 http://issues.apache.org/jira/browse/LUCENE-444
 and
 http://issues.apache.org/jira/browse/LUCENE-461
 I'm have almost no experience with Nutch, but you can handle it like
 those issues above.
 
 Both fixes should probably be ported back to NuatchAnalysis.jj.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-568) Indexer does not update the Lucene TITLE field

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-568.
---

Resolution: Won't Fix

 Indexer does not update the Lucene TITLE field
 

 Key: NUTCH-568
 URL: https://issues.apache.org/jira/browse/NUTCH-568
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: Windows XP
Reporter: smorales
 Attachments: RN-071018-24.html


 Hi,
 The indexer is unable to update the field TITLE of the Lucene index when 
 processing specific html documents.
 This issue has been reproduced using Nutch-Nightly Build #241 (Oct 19, 2007 
 4:01:28 AM)
 The problem does not occurs using NUTCH 9.0.
 Workflow:
 1.- Extracted package and copy across the following configuration files from 
 NUTCH 9.0
 - {nutch_home_9.0}/bin/url folder, containing the urls
 - {nutch_home_9.0}/conf/nutch-site.xml
 - {nutch_home_9.0}/conf/crawl-urlfilter.txt
 2.- To reproduce the issue, you need to copy the attached html document to 
 your webserver/filesytem.
 3.- Run the crawl.
 For example: ./nutch crawl urls -dir crawl -depth 22
 4.- Open the index using Luke.  For this test, I used lukeall-0.7.1.jar
 5.- Select the window select the document tab, move thru the docs until you 
 find our html document.
 You will see that the TITLE field is empty  -- INCORRECT because this html 
 document contains a title.
 6.- Now, open the html document, add a space anywhere then save it again.
 7.- Repeat step 3 and 4.
 You will notice that this time the field TITLE field contains the correct 
 information
 Please advice,
 Many thanks in advance for your support.
 Sergio

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-249) black- white list url filtering

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-249.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 black- white list url filtering
 ---

 Key: NUTCH-249
 URL: https://issues.apache.org/jira/browse/NUTCH-249
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Stefan Groschupf
Assignee: Dennis Kubes
Priority: Trivial
 Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch, bw.patch


 Existing url filter mechanisms need to process each url against each filter 
 pattern. For very large filter sets this may be does not scale very well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-709) JSParseFilter gets into an infinate loop and ets all the stack

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-709.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 JSParseFilter gets into an infinate loop and ets all the stack 
 ---

 Key: NUTCH-709
 URL: https://issues.apache.org/jira/browse/NUTCH-709
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Hadoop 0.19.0 running nutch trunk 
Reporter: Tim Hawkins
 Attachments: JSParseFilter.error.patch


 When crawling pages with seperate fetch and parse, I see processes die 
 becuase of stack overflow. 
 Output is generaly.
 java.lang.StackOverflowError
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:146)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
   at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:148)
 Inspection of the code shows that this is a recursive call to walk(.) 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-289) CrawlDatum should store IP address

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-289.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 CrawlDatum should store IP address
 --

 Key: NUTCH-289
 URL: https://issues.apache.org/jira/browse/NUTCH-289
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cutting
 Attachments: ipInCrawlDatumDraftV1.patch, 
 ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, 
 ipInCrawlDatumDraftV5.patch


 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-496.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 ConcurrentModificationException can be thrown when getSorted() is called.
 -

 Key: NUTCH-496
 URL: https://issues.apache.org/jira/browse/NUTCH-496
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Nutch application, during fetch.
Reporter: Briggs
 Attachments: language_analyzer_ngram.patch, nutch-496.txt


 NGramProfile (within the org.apache.nutch.analysis.lang) package is not 
 thread-safe due to a ConcurrentModificationException that can occur if during 
 iteration of the resultant List from getSorted() and another call to 
 getSorted() is invoked from within another thread.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-424) NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4))

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-424.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists 
 with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4))
 --

 Key: NUTCH-424
 URL: https://issues.apache.org/jira/browse/NUTCH-424
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0
 Environment: Linux and Windows
Reporter: Karsten Dello

 I've tracked down occasional fetcher hangs to NekoHTML's DOMFragmentParser 
 hanging certain HTML documents, for example, 
 http://www.inlandrevenue.gov.uk/charities/chapter_3.htm.
 The thread dump on the hung parser is:
 CompilerThread0 daemon prio=1 tid=0x080c4c18 nid=0x47da waiting on 
 condition [0x..0x8a3daf68]
 Signal Dispatcher daemon prio=1 tid=0x080c3d60 nid=0x47d9 waiting on 
 condition [0x..0x]
 Finalizer daemon prio=1 tid=0x080b8818 nid=0x47d8 in Object.wait() 
 [0x8a2a..0x8a2a0680]
 at java.lang.Object.wait(Native Method)
 - waiting on 0x4a60d058 (a java.lang.ref.ReferenceQueue$Lock)
 at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
 - locked 0x4a60d058 (a java.lang.ref.ReferenceQueue$Lock)
 at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
 at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
 Reference Handler daemon prio=1 tid=0x080b7b50 nid=0x47d7 in Object.wait() 
 [0x8a21f000..0x8a21f800]
 at java.lang.Object.wait(Native Method)
 - waiting on 0x4a60d0d8 (a java.lang.ref.Reference$Lock)
 at java.lang.Object.wait(Object.java:474)
 at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
 - locked 0x4a60d0d8 (a java.lang.ref.Reference$Lock)
 main prio=1 tid=0x0805c170 nid=0x47d1 waiting on condition 
 [0xbfffc000..0xbfffcec8]
 at 
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
 at 
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
 at java.lang.StringBuffer.append(StringBuffer.java:225)
 - locked 0x45910118 (a java.lang.StringBuffer)
 at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
 at org.cyberneko.html.parsers.DOMFragmentParser.characters(Unknown 
 Source)
 at org.cyberneko.html.filters.DefaultFilter.characters(Unknown Source)
 at org.cyberneko.html.HTMLTagBalancer.characters(Unknown Source)
 at 
 org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(Unknown Source)
 at org.cyberneko.html.HTMLScanner$ContentScanner.scan(Unknown Source)
 at org.cyberneko.html.HTMLScanner.scanDocument(Unknown Source)
 at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
 at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
 at org.cyberneko.html.parsers.DOMFragmentParser.parse(Unknown Source)
 at net.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:157)
 at net.nutch.parse.ParserChecker.main(ParserChecker.java:74)
 VM Thread prio=1 tid=0x080b4f30 nid=0x47d6 runnable
 VM Periodic Task Thread prio=1 tid=0x080c75f8 nid=0x47dc waiting on 
 condition
 Using the URL mentioned above, I was able to successfully parse the file 
 using a normal NekoHTML DocumentParser.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-119) Regexp to extract outlinks incorrect

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-119.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Regexp to extract outlinks incorrect
 

 Key: NUTCH-119
 URL: https://issues.apache.org/jira/browse/NUTCH-119
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.7.1, 0.7.2, 0.8
Reporter: Sébastien Le Callonnec
 Attachments: TestPattern.java, TestPattern.java


 The regexp which extracts outlinks is incorrect.  It extracts in-line CSS 
 styles, and leaves out link such as a
  href=/sitemap.htmlbrowse/a.  This has been reported by Earl Cahill on 
  the user list.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-414) parse-mp3 plugin concatenating previous tags for text field

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-414.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 parse-mp3 plugin concatenating previous tags for text field
 ---

 Key: NUTCH-414
 URL: https://issues.apache.org/jira/browse/NUTCH-414
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: -
Reporter: Brian Whitman

 The parse-mp3 plugin seems to be saving a state of the previous parse's text 
 content. For every new mp3 file parsed, it is putting the contents of all the 
 previous text fields in the plain text field for that file.
 You can see this by fetching a set of mp3s in one segment, then viewing their 
 plain text in the nutch webapp. The plaintext will include the contents of 
 all files fetched in that round, which makes searching fruitless.
 I made a tiny band-aid change to MP3Parser.java and MetadataCollector.java 
 against the nightly. It seems to fix the problem.
 --- MP3Parser.java  2006-12-10 09:43:26.0 -0500
 +++ MP3Parser.java.new  2006-12-10 16:37:03.0 -0500
 @@ -67,7 +67,7 @@
fos.write(raw);
fos.close();
MP3File mp3 = new MP3File(tmp);
 -
 + metadataCollector.clearText();
if (mp3.hasID3v2Tag()) {
  parse = getID3v2Parse(mp3, content.getMetadata());
} else if (mp3.hasID3v1Tag()) {
 --- MetadataCollector.java  2006-12-10 09:43:26.0 -0500
 +++ MetadataCollector.java.new  2006-12-10 16:37:28.0 -0500
 @@ -42,6 +42,10 @@
this.conf = conf;
}
 +  public void clearText() {
 +   text = ;
 +  }
 +
public void notifyProperty(String name, String value) throws
 MalformedURLException {
  if (name.equals(TIT2-Text))
setTitle(value);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-113) Disable permanent DNS-to-IP caching for JVM 1.4

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-113.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Disable permanent DNS-to-IP caching for JVM 1.4
 ---

 Key: NUTCH-113
 URL: https://issues.apache.org/jira/browse/NUTCH-113
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.7.2, 0.8
Reporter: Fuad Efendi
Priority: Trivial

 DNS-to-IP mapping may change during long crawls, by default JVM 1.4 caches it 
 forever.
 Some related discussions at Jakarta-HttpClient-User
 http://mail-archives.apache.org/mod_mbox/jakarta-httpclient-user/200506.mbox/%3c20050627022440.SVIL13442.lakermmtao05.cox.net@zeus%3e
 http://java.sun.com/j2se/1.4.2/docs/guide/net/properties.html
networkaddress.cache.ttl (default: -1) 
Specified in java.security to indicate the caching policy for successful 
 name lookups from the name service.. The value is specified as as integer to 
 indicate the number of seconds to cache the successful lookup. 
A value of -1 indicates cache forever. 
 We probably need this code in org.apache.nutch.fetcher.Fetcher:
   private static final int FETCHER_DNS_TTL_MINUTES =
 NutchConf.get().getInt(fetcher.dns.ttl.minutes, 120);
   static {
 java.security.Security.setProperty(networkaddress.cache.ttl,  + 
 FETCHER_DNS_TTL_MINUTES*60);
   }
 And, new property in nutch-default.xml:
 property
   namefetcher.dns.ttl.minutes/name
   value120/value
   descriptionDNS-to-IP cache, Time-to-Live/description
 /property

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-87) Efficient site-specific crawling for a large number of sites

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-87?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-87.
--

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Efficient site-specific crawling for a large number of sites
 

 Key: NUTCH-87
 URL: https://issues.apache.org/jira/browse/NUTCH-87
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.7.2, 0.8
 Environment: cross-platform
Reporter: AJ Chen
 Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, 
 build.xml.patch-0.8, urlfilter-whitelist.tar.gz


 There is a gap between whole-web crawling and single (or handful) site 
 crawling. Many applications actually fall in this gap, which usually require 
 to crawl a large number of selected sites, say 10 domains. Current 
 CrawlTool is designed for a handful of sites. So, this request calls for a 
 new feature or improvement on CrawTool so that nutch crawl command can 
 efficiently deal with large number of sites. One requirement is to add or 
 change smallest amount of code so that this feature can be implemented sooner 
 rather than later. 
 There is a discussion about adding a URLFilter to implement this requested 
 feature, see the following thread - 
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
 The idea is to use a hashtable in URLFilter for looking up regex for any 
 given domain. Hashtable will be much faster than list implementation 
 currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
 such idea before for his own application and is willing to make it available 
 for adaptation to Nutch. I'll be happy to help him in this regard.  
 But, before we do it, we would like to hear more discussions or comments 
 about this approach or other approaches. Particularly, let us know what 
 potential downside will be for hashtable lookup in a new URLFilter plugin.
 AJ Chen

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-460) RDF parser plugin

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-460.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 RDF parser plugin
 -

 Key: NUTCH-460
 URL: https://issues.apache.org/jira/browse/NUTCH-460
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Ricardo J. Méndez
 Attachments: rubyspider-rdf.zip


 I've written a couple plugins that I'd like to contribute.  
 RDFLinkParseFilter looks for links on the pages that point towards RDF 
 information, and tags the pages with metadata about the type of links they 
 hold. RDFLinkIndexingFilter indexes said metadata.  RDFParser parses RDF 
 information from several possible formats using Jena, and extracts the links 
 that the file points to as Outlinks so that they can be fetched as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-182) Log when db.max configuration limits reached

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-182.
---

Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Log when db.max configuration limits reached
 

 Key: NUTCH-182
 URL: https://issues.apache.org/jira/browse/NUTCH-182
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Matt Kangas
Priority: Trivial
 Attachments: LinkDb.java.patch, ParseData.java.patch


 Followup to http://www.nabble.com/Re%3A-Can%27t-index-some-pages-p2480833.html
 There are three db.max parameters currently in nutch-default.xml:
  * db.max.outlinks.per.page
  * db.max.anchor.length
  * db.max.inlinks
 Having values that are too low can result in a site being under-crawled. 
 However, currently there is nothing written to the log when these limits are 
 hit, so users have to guess when they need to raise these values.
 I suggest that we add three new log messages at the appropriate points:
  * Exceeded db.max.outlinks.per.page for URL 
  * Exceeded db.max.anchor.length for URL 
  * Exceeded db.max.inlinks for URL 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-826) Mailing list is broken.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-826.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Mailing list is broken.
 ---

 Key: NUTCH-826
 URL: https://issues.apache.org/jira/browse/NUTCH-826
 Project: Nutch
  Issue Type: Bug
Reporter: John Sherwood
Assignee: Julien Nioche
Priority: Blocker
 Fix For: 1.1


 All of the following addresses are failing:
 nutch-u...@nutch.apache.org
 nutch-user-subscr...@nutch.apache.org
 nutch-user-subscr...@lucene.apache.org
 For the last one, the mailer daemon said 
 This mailing list has moved to user at nutch.apache.org.
 Below is the message I tried to send:
 Hi people,
 I've been banging my head against this problem for two days now.
 Simply, I want to add a field with the value of a given meta tag.
 I've been trying the parse-xml plugin, but that seems that it doesn't
 work with version 1.0.  I've tried the code at
 http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html
 and it hasn't worked.  I don't even know why.  I don't even know if my
 plugin is being used... or even looked for!  Nutch seems to have a
 infuriating Fail silently policy for plugins.  I put a
 System.exit(1) in my filters just to see if my code is even being
 encountered.  It has not in spite of my config telling it to.
 Here's my config:
 nutch-site.xml
 ...
 property
  nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|metadata/value
 /property
 ...
 parse-plugins.xml
 ...
 mimeType name=application/xhtml+xml
plugin id=parse-html /
plugin id=metadata /
 /mimeType
 mimeType name=text/html
   plugin id=parse-html /
   plugin id=metadata /
 /mimeType
 mimeType name=text/sgml
   plugin id=parse-html /
   plugin id=metadata /
 /mimeType
 mimeType name=text/xml
  plugin id=parse-html /
  plugin id=parse-rss /
 plugin id=metadata /
 plugin id=feed /
 /mimeType
 ...
 alias name=metadata
 extension-id=com.example.website.nutch.parsing.MetaTagExtractorParseFilter
 /
 ...
 I've also copied the plugin.xml and jar from my build/metadata to the
 plugins root dir.
 Nonetheless, Nutch runs and puts data in solr for me.  Afaik, Nutch is
 completely unaware of my plugin despite my config options.  Is the
 some other place I need to tell Nutch to use my plugin?  Is there some
 other approach to do this without having to write a plugin?  This does
 seem like a lot of work to simply get a meta tag into a field.  Any
 help would be appreciated.
 Sincerely,
 John Sherwood

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-570) Improvement of URL Ordering in Generator.java

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-570.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-742) Checksum Error

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-742.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Checksum Error 
 ---

 Key: NUTCH-742
 URL: https://issues.apache.org/jira/browse/NUTCH-742
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: linux ubuntu8.0.4 64bit 
 10datanode 4G of memory per node 
Reporter: mawanqiang

 Approximately 1 million data used to create index when nutch1.0 error.
 The error is:
 java.lang.RuntimeException: problem advancing post rec#6758513
 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:883)
 at 
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
 at 
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
 at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:79)
 at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
 at org.apache.hadoop.mapred.Child.main(Child.java:158)
 Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error
 at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:153)
 at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:90)
 at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:301)
 at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:331)
 at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:315)
 at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:377)
 at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:174)
 at 
 org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:277)
 at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:297)
 at org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:922)
 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:881)
 ... 6 more

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-44) too many search results

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-44.
--


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 too many search results
 ---

 Key: NUTCH-44
 URL: https://issues.apache.org/jira/browse/NUTCH-44
 Project: Nutch
  Issue Type: Bug
  Components: web gui
 Environment: web environment
Reporter: Emilijan Mirceski
Assignee: Dennis Kubes
 Attachments: NUTCH-44-2-20080215.patch, NUTCH-44.patch


 There should be a limitation (user defined) on the number of results the 
 search engine can return. 
 For example, if one modifies the seach url as:
 http://my/search.jsp?query=some quieryhitsPerPage=2hitsPerSite=0
 The search will try to return 20,000 pages which isn't good for the server 
 side performance. 
 Is it possible to have a setting in the config xml files to control this?
 Thanks,
 Emilijan

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-854) Define standard attributes with values and explaination to configuration files in conf directory

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-854.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Define standard attributes with values and explaination to configuration 
 files in conf directory
 

 Key: NUTCH-854
 URL: https://issues.apache.org/jira/browse/NUTCH-854
 Project: Nutch
  Issue Type: Improvement
 Environment: Window XP SP3, Cygwin, JDK 1.6.20, Ant 1.8.1
Reporter: Pham Tuan Minh
 Fix For: 2.0


 It would make nutch easier to use if all configuration file in conf directory 
 is defined standard attributes with values and explanation. For example, 
 currently nutch-site.xml.template contains no attributes and no explanation, 
 we should define them.
 -
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 !-- site-specific property overrides in this file. --
 configuration
 !-- Agent name--
 property
 namehttp.agent.name/name
 valuenutch-solr-integration/value
 /property
 !
 property
 namegenerate.max.per.host/name
 value100/value
 /property
 property
 !-- plug-in using in this site --
 nameplugin.includes/name
 valueprotocol-http|urlfilter-regex|parse-tika|scoring-opic|urlnormalizer-(pass|regex|basic)/value
 /property
 /configuration
 -
 Thanks,

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-958) Httpclient scheme priority order fix

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-958.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-866) STOP Nutch without breaking the crawled data

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-866.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 STOP Nutch without breaking the crawled data
 

 Key: NUTCH-866
 URL: https://issues.apache.org/jira/browse/NUTCH-866
 Project: Nutch
  Issue Type: New Feature
Reporter: Pham Tuan Minh
 Fix For: 2.0


 How we can stop running nutch instance in local mode and in reducer mode 
 without breaking the crawled data? 
 For example, you push a list of site that take a long time to complete crawl; 
 then you want to stop nutch instance suddenly ...
 - For local mode, I suggest as below
 We create a stop.txt file in specific directory, then for a piece of time, 
 nutch instance will check whether this file existed or not; if existed, nutch 
 instance will stop itself normally
 - For reducer mode, may we use zookeper to keep state of each instance?
 Any other suggestion?
 Thanks,

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-86) LanguageIdentifier API enhancements

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-86?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-86.
--


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 LanguageIdentifier API enhancements
 ---

 Key: NUTCH-86
 URL: https://issues.apache.org/jira/browse/NUTCH-86
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.6, 0.7, 0.8
Reporter: Jerome Charron
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.2, 2.0


 More informations can be found on the following thread on Nutch-Dev mailing 
 list:
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html
 Summary:
 1. LanguageIdentifier API changes. The similarity methods should return an 
 ordered array of language-code/score pairs instead of a simple String 
 containing the language-code.
 2. Ensure consistency between LanguageIdentifier scoring and 
 NGramProfile.getSimilarity().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-591) StringIndexOutOfBoundsException when extracting text from a Word document.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-591.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 StringIndexOutOfBoundsException when extracting text from a Word document.
 --

 Key: NUTCH-591
 URL: https://issues.apache.org/jira/browse/NUTCH-591
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: linux
 redhat as4u4 x86
 kernel 2.6.9
Reporter: frank ling

 see 
 http://issues.apache.org/bugzilla/show_bug.cgi?id=41076+

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-363) Fetcher normalizes everything at least twice

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-363.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Fetcher normalizes everything at least twice
 

 Key: NUTCH-363
 URL: https://issues.apache.org/jira/browse/NUTCH-363
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
 Environment: OS X 10.4.7
Reporter: Doug Cook
Priority: Minor
 Fix For: 2.0


 New links are normalized twice by the fetcher: 
 First in DOMContentUtils.getOutlinks, where the constructor 
 Outlink(url.toString(), linkText.toString().trim(), conf)  normalizes the URL.
 The second time is in ParseOutputFormat.write().
 For some URLs (e.g. those repeated on a page) a given URL may be normalized a 
 number of times, but it is always normalized at least twice.
 For those of us with expensive normalizations, this is probably burning some 
 CPU. 
 I'd gladly fix this, but I'm not yet familiar enough with the code to know if 
 there are some hidden assumptions which rely on this behavior.
 [A related note is that URLs are normalized *before* filtering; this is 
 causing a lot of extra normalization as well. In general, filters may not be 
 safe to run before normalization, but there is likely a class of them which 
 are (filtering out .gif/.jpg etc). Perhaps the notion of a pre-normalizer 
 filter would be a useful one?]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-185) XMLParser is configurable xml parser plugin.

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-185.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 XMLParser is configurable xml parser plugin.
 

 Key: NUTCH-185
 URL: https://issues.apache.org/jira/browse/NUTCH-185
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, indexer
Affects Versions: 0.7.2, 0.8, 0.8.1
 Environment: OS Independent
Reporter: Rida Benjelloun
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: parse-xml.patch, parse-xml.zip, parse-xml.zip


 Xml parser  is configurable plugin. It use XPath and namespaces to do the 
 mapping between the XML elements and Lucene fields. 
 Informations :
 1- Copy xmlparser-conf.xml to the nutch/conf dir
 2- To index your custom XML file, you have to modify the 
 xmlparser-conf.xml. 
 This parser uses namespaces and XPATH to parse XML content
 The config file do the mapping between the XML noeds (using XPATH) and lucene 
 field. 
 Example : field name=dctitle xpath=//dc:title type=Text boost=1.4 / 
 3- The xmlIndexerProperties encapsulate a set of fields associated to a 
 namespace. 
 If the namespace is found in the xml document, the fields represented by the 
 namespace will be indexed.
 Example : 
 xmlIndexerProperties type=filePerDocument namespace= 
 http://purl.org/dc/elements/1.1/;
   field name=dctitle xpath=//dc:title type=Text boost= 1.4 / 
   field name=dccreator xpath=//dc:creator type=keyword boost= 1.0 / 
 /xmlIndexerProperties
 4- It is possible to define a default namespace that will be applied when the 
 parser 
 didn't find any namespace in the document or when the namespace found in the 
 xml document doesn't match with the namespace defined in the 
 xmlIndexerProperties. 
 Example :
 xmlIndexerProperties type=filePerDocument namespace=default
   field name=xmlcontent xpath=//* type=Unstored boost=1.0 / 
 /xmlIndexerProperties

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-310) Review Log Levels

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-310.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Review Log Levels
 -

 Key: NUTCH-310
 URL: https://issues.apache.org/jira/browse/NUTCH-310
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Jerome Charron
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 2.0


 Review of logs content and logs levels (see Commons Logging Best Parctices : 
 http://jakarta.apache.org/commons/logging/guide.html#Message_Priorities_Levels)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-659) Help! No urls fetched for internal repository website

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-659.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Help! No urls fetched for internal repository website
 -

 Key: NUTCH-659
 URL: https://issues.apache.org/jira/browse/NUTCH-659
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: nutch 0.9, TOMCAT6.0.18, JAVA 1.6.0_10, CentOS 5.2
Reporter: Bryan
Priority: Critical

 I am new to Nutch, and implemented Nutch for my internal company websites 
 search. The version is nutch-2008-11-02_04-01-26.tar.
  
 My internal company websites includes several HTTP websites. 
 Another one is SVN repository HTTPS websites in XML structure, using dir 
 and file tag.
  
 The search in HTTP websites is good. 
 The HTTPS is ok. We have some links in those HTTP websites which point to 
 Word files under SVN website. They can be indexed.
  
 But the Nutch does not search my SVN website. If I only search the SVN 
 website, it is always: 0 urls fetched.
  
 My nutch-site.xml is as following:
 property
   nameplugin.includes/name
   
 valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
  
 # skip file:, ftp:,  mailto: urls
 -^(ftp|mailto):
  
 # accept hosts in MY.DOMAIN.NAME
 +^http://([a-z0-9]*\.)*smartlabs.com.au/
  
 Any help would be much appreciated. Thanks in advnce.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-774) Retry interval in crawl date is set to 0

2011-04-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-774.
---


Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

 Retry interval in crawl date is set to 0
 

 Key: NUTCH-774
 URL: https://issues.apache.org/jira/browse/NUTCH-774
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Chris A. Mattmann
 Fix For: 1.2, 2.0

 Attachments: NUTCH-774.patch, NUTCH-774_2.patch


 When i fetch and parse a feed with the feed plugin,
 http://www.wachauclimbing.net/home/impressum-disclaimer/feed/
 another crawl date is generated
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
 after fetching a second round
 the dump in the crawl db still shows a retry interval with value 0.
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ 
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Wed Dec 02 12:48:22 CET 2009
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 0 seconds (0 days)
 Score: 1.084
 Signature: db9ab2193924cd2d0b53113a500ca604
 Metadata: _pst_: success(1), lastModified=0
 a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in 
 the
 method 
 setFetchSchedule

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >