Re: Parse-tika ignores too much data...

2010-07-08 Thread Andrzej Bialecki
On 2010-07-07 22:32, Ken Krugler wrote: Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way body is handled, we also saw cases were it was twice in the output. Don't know

Re: Classifying pages on Nutch: plugins?

2010-07-08 Thread Julien Nioche
Daniel, Your message is not relevant for this mailing list. If you have questions about the TC API use http://groups.google.com/group/digitalpebble instead. Thanks On 8 July 2010 01:56, dgimenes dran...@gmail.com wrote: Julien, I'm in Luan's project too. I'd like to know if you have

[jira] Updated: (NUTCH-844) Improve NutchConfiguration

2010-07-08 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-844: Attachment: conf.patch Improve NutchConfiguration --

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-08 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886318#action_12886318 ] Andrzej Bialecki commented on NUTCH-843: - runtime/local doesn't need Hadoop

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886323#action_12886323 ] Julien Nioche commented on NUTCH-843: - OK - for some reason I thought we could use

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-08 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886330#action_12886330 ] Andrzej Bialecki commented on NUTCH-843: - Pseudo-distributed (i.e. a real

[jira] Resolved: (NUTCH-845) Native hadoop libs not available through maven

2010-07-08 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-845. - Fix Version/s: 2.0 Resolution: Fixed Committed in rev. 961778. Thanks for review!