Component fetching during parsing. (vertical crawling)

2010-07-20 Thread Ferdy
Hello, We are currently using a heavily modified version of nutch. The main reason for this is the fact that we do not only fetch the urls that the QueueFeeder submits, but also additional resources from urls that are constructed during parsing. So for example let's say the QueueFeeder

Re: Component fetching during parsing. (vertical crawling)

2010-07-20 Thread Andrzej Bialecki
On 2010-07-20 14:30, Ferdy wrote: Hello, We are currently using a heavily modified version of nutch. The main reason for this is the fact that we do not only fetch the urls that the QueueFeeder submits, but also additional resources from urls that are constructed during parsing. So for

Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

2010-07-20 Thread Julien Nioche
Now that you mention upgrade solutions from 1.x to 2.0 I suggest that we open a JIRA to discuss this. IMHO we probably don't want to keep the 'old' code in src/java when we merge but could have the code for the conversion utilities and the Nutch 1.x jars in a the contrib/ directory

[jira] Resolved: (NUTCH-856) Use Tika for parsing feed

2010-07-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-856. - Resolution: Fixed thanks Chris for reviewing and committing TIKA-466. I will mark the issue as

Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

2010-07-20 Thread Julien Nioche
Thanks for your comments Chris However we still need to address the issue raise by Dogacan i.e shall we provide tools to convert from 1.x structures to 2.0 and if so how shall we organise it. Again - some things have been removed fom NutchBase for the sake of clarity but since they are

[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

2010-07-20 Thread Scott Gonyea (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Gonyea updated NUTCH-855: --- Fix Version/s: 2.0 Description: This plugin is designed to enhance the NUTCH-655 patch, by