Re: Url regex normalizer

2009-02-27 Thread Andrzej Bialecki
Meghna Kukreja wrote: Hey, I encountered the following problem while trying to crawl a site using nutch-trunk. In the file regex-normalize.xml, the following regex is used to remove session ids: pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern. This pattern

Re: NutchAnalysis.java STOP_WORDS not configurable?

2009-02-27 Thread Otis Gospodnetic
I believe Lucene has (in contrib/analyzers) a class called WordLoader or something like that. Perhaps you can use that to load stopwords from a file (like Solr does) and submit that as a patch? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message

[jira] Created: (NUTCH-706) Url regex normalizer

2009-02-27 Thread Meghna Kukreja (JIRA)
Url regex normalizer Key: NUTCH-706 URL: https://issues.apache.org/jira/browse/NUTCH-706 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Meghna Kukreja Priority: Minor

[jira] Commented: (NUTCH-706) Url regex normalizer

2009-02-27 Thread Meghna Kukreja (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677460#action_12677460 ] Meghna Kukreja commented on NUTCH-706: -- The pattern should be changed to:

Re: Url regex normalizer

2009-02-27 Thread Meghna Kukreja
Thanks Andrzej. Here is the issue that I created in JIRA: https://issues.apache.org/jira/browse/NUTCH-706. I have suggested an alternative regular expression but would appreciate if someone could verfiy this as I am not very great with those :) Thanks! On Fri, Feb 27, 2009 at 12:10 PM, Andrzej

[jira] Closed: (NUTCH-703) Upgrade to Hadoop 0.19.1

2009-02-27 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-703. --- Resolution: Fixed Fixed in rev. 748637. Upgrade to Hadoop 0.19.1

[jira] Commented: (NUTCH-705) parse-rtf plugin

2009-02-27 Thread Sami Siren (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677508#action_12677508 ] Sami Siren commented on NUTCH-705: -- I think that the patch contains some lgpl code that we

Re: Url regex normalizer

2009-02-27 Thread Sami Siren
Meghna Kukreja wrote: Thanks Andrzej. Here is the issue that I created in JIRA: https://issues.apache.org/jira/browse/NUTCH-706. I have suggested an alternative regular expression but would appreciate if someone could verfiy this as I am not very great with those :) Perhaps you could write

[jira] Commented: (NUTCH-703) Upgrade to Hadoop 0.19.1

2009-02-27 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677646#action_12677646 ] Hudson commented on NUTCH-703: -- Integrated in Nutch-trunk #738 (See

[jira] Commented: (NUTCH-699) Add an official solr schema for solr integration

2009-02-27 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677647#action_12677647 ] Hudson commented on NUTCH-699: -- Integrated in Nutch-trunk #738 (See

Re: Release 1.0?

2009-02-27 Thread dealmaker
Hi, Is there going to be a delay of the 1.0 release? Today is almost Feb 28. You said that 1.0 will come in Feb. I am customizing Nutch 0.9, and I am wondering if I should wait couple more days for the 1.0 release. Thanks. Andrzej Bialecki wrote: Marko Bauhardt wrote: Hi, is there