[jira] Created: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

2006-08-16 Thread Chris Schneider (JIRA)
Generator is building fetch list using *lowest* scoring URLs Key: NUTCH-348 URL: http://issues.apache.org/jira/browse/NUTCH-348 Project: Nutch Issue Type: Bug

Tika update

2006-08-16 Thread Jukka Zitting
Hi, There was recently discussion on perhaps starting a new Lucene sub-project, named Tika, to create a general-purpose library from the parser components and other features in Nutch that might interest a wider audience. To keep things rolling we've created a temporary staging area for the

[jira] Created: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Andrzej Bialecki (JIRA)
Port Nutch to use Hadoop Text instead of UTF8 - Key: NUTCH-349 URL: http://issues.apache.org/jira/browse/NUTCH-349 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0

Thoughts on Parser design and dependencies

2006-08-16 Thread Jukka Zitting
Hi, I have some questions about the dependencies of the Parser interface, especially from the perspective of generalizing it to the potential Tika project. The current dependencies are: * Configurable - depends on the Hadoop configuration system * Pluggable - depends on the Nutch plugin

Webinterface ignores hidden language field

2006-08-16 Thread David Podunavac
Hi there, I have noticed that nutch does not care which language is selected. When I do a search, nutch ignores which request.parameter was sent, instead it always uses the browsers config. Does anyone know what has to be changed? I tried to change the search.jsp but it will not be applied in

[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428399 ] Sami Siren commented on NUTCH-349: -- I anything at all should be done then I'd go for #2. There was also a total incombatibility from 0.7 to 0.8 and I didn't see

Re: Tika update

2006-08-16 Thread Chris Mattmann
Hi Jukka, Thanks for your email. Indeed, there was discussion on the Lucene PMC email list, about the Tika project. It was decided by the powers that be to discuss it more on the Nutch mailing list before moving forward with any vote on making Tika a sub-project of Apache Lucene. With regards to

Re: Any plans to move to build Nutchusing Maven?

2006-08-16 Thread Chris Mattmann
Hi Steven, On 8/16/06 7:36 AM, steven shingler [EMAIL PROTECTED] wrote: (This thread moved from the User List.) OK Lukas, lets open it up to the dev list! :) Particularly, does the group feel moving to Maven would be _a good thing_ ? +1 I suggested this (however did not make any

Re: Any plans to move to build Nutchusing Maven?

2006-08-16 Thread William Surowiec
I am an outsider, I hope my comments do not cause grief. AFAIK, Maven is transitioning to version 2. The newer version is more attractive, but noticeably different than version 1 in implementation but, closer to home, in configuration. It might be wise to explore some of the comments from other,

Re: Any plans to move to build Nutchusing Maven?

2006-08-16 Thread Lukas Vlcek
Hi, I have almost no experience with maven subprojects but somehow I feel this could help us with Nutch plugins. Am I correct? In maven we can always call ant goals as well and Jelly is a fun to use. With maven one of the biggest benefit would be that eclipse (or other IDE) classpath settings

Re: Any plans to move to build Nutchusing Maven?

2006-08-16 Thread Sami Siren
Lukas Vlcek wrote: Hi, I have almost no experience with maven subprojects but somehow I feel this could help us with Nutch plugins. Am I correct? In maven we can always call ant goals as well and Jelly is a fun to use. With maven one of the biggest benefit would be that eclipse (or other IDE)

Re: Tika update

2006-08-16 Thread Sami Siren
Chris Mattmann wrote: However, the current Nutch software contains many value-added pieces of code that are monolithically packaged together. If the services and capabilities from the code were provided as separate, modular component libraries, such services and capabilities could benefit many

Re: Tika update

2006-08-16 Thread Jukka Zitting
Hi, On 8/16/06, Sami Siren [EMAIL PROTECTED] wrote: IMO to solve the main problem one does not need to set up another project, just refactor and repackage. I'd be happy either way, as long as I get a nice reusable library to use in Jackrabbit. :-) I think the key question on whether to

Nutch, samba and urls...

2006-08-16 Thread René Treffer
Hi, I've just written an protocol-smb, it's really simple (code attached). It uses the jcifs lib and seems to work - but there is some stuff I'd like to discuss... Nutch is glued to URL, which works if you write an URLHandler. No Problem so far, but you can't install an URLHandler

Re: Any plans to move to build Nutchusing Maven?

2006-08-16 Thread Nicolas Lalevée
Le Mercredi 16 Août 2006 17:18, Sami Siren a écrit : Lukas Vlcek wrote: Hi, I have almost no experience with maven subprojects but somehow I feel this could help us with Nutch plugins. Am I correct? In maven we can always call ant goals as well and Jelly is a fun to use. With maven

Re: Any plans to move to build Nutchusing Maven?

2006-08-16 Thread Lukas Vlcek
Hi, I have just noticed that there is going on some activity called Tika (http://code.google.com/p/tika/) and these guys are starting directly with Maven2 (am I right?). The more Nutch/Lucene/Hadoop/[Tika]/[?] thing grows the more sophisticated project management tool will be needed I think.

HTTP Accept Header seems to be missing

2006-08-16 Thread Michael Wechner
Hi It seems to me that Nutch does not send a HTTP Accept Header. Is that on purpose? I would have expected that Nutch tells the server which mime-types it accepts resp. is able to parse and index, but maybe I misunderstand something. Thanks Michi -- Michael Wechner Wyona - Open

[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ] Stefan Groschupf commented on NUTCH-349: my vote goes to #2. Having a tool that need to be started manually would be better than complicate the already

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] Stefan Groschupf commented on NUTCH-233: Hi Otis, yes for a serious whole web crawl I need to change this reg ex first. It only hangs with some random urls

[jira] Updated: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ] Stefan Groschupf updated NUTCH-348: --- Attachment: sortPatchV1.patch What people think about this kind of solution? Generator is building fetch list using *lowest* scoring URLs