Meghna Kukreja wrote:
Hey,
I encountered the following problem while trying to crawl a site using
nutch-trunk. In the file regex-normalize.xml, the following regex is
used to remove session ids:
pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern.
This pattern
I believe Lucene has (in contrib/analyzers) a class called WordLoader or
something like that. Perhaps you can use that to load stopwords from a file
(like Solr does) and submit that as a patch?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
Url regex normalizer
Key: NUTCH-706
URL: https://issues.apache.org/jira/browse/NUTCH-706
Project: Nutch
Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Meghna Kukreja
Priority: Minor
[
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677460#action_12677460
]
Meghna Kukreja commented on NUTCH-706:
--
The pattern should be changed to:
Thanks Andrzej.
Here is the issue that I created in JIRA:
https://issues.apache.org/jira/browse/NUTCH-706. I have suggested an
alternative regular expression but would appreciate if someone could
verfiy this as I am not very great with those :)
Thanks!
On Fri, Feb 27, 2009 at 12:10 PM, Andrzej
[
https://issues.apache.org/jira/browse/NUTCH-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrzej Bialecki closed NUTCH-703.
---
Resolution: Fixed
Fixed in rev. 748637.
Upgrade to Hadoop 0.19.1
[
https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677508#action_12677508
]
Sami Siren commented on NUTCH-705:
--
I think that the patch contains some lgpl code that we
Meghna Kukreja wrote:
Thanks Andrzej.
Here is the issue that I created in JIRA:
https://issues.apache.org/jira/browse/NUTCH-706. I have suggested an
alternative regular expression but would appreciate if someone could
verfiy this as I am not very great with those :)
Perhaps you could write
[
https://issues.apache.org/jira/browse/NUTCH-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677646#action_12677646
]
Hudson commented on NUTCH-703:
--
Integrated in Nutch-trunk #738 (See
[
https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677647#action_12677647
]
Hudson commented on NUTCH-699:
--
Integrated in Nutch-trunk #738 (See
Hi,
Is there going to be a delay of the 1.0 release? Today is almost Feb 28.
You said that 1.0 will come in Feb. I am customizing Nutch 0.9, and I am
wondering if I should wait couple more days for the 1.0 release.
Thanks.
Andrzej Bialecki wrote:
Marko Bauhardt wrote:
Hi,
is there
11 matches
Mail list logo