> Do we talk about parsing rdf or do we discuss to store parsed html
> text in rdf and convert it via xslt to pure text?
> I may misunderstand something. I very like the idea of a general rdf
> parser. Back in the days i played around with jena.sf.net
> Parsing yes, replace nutch sequence file and the concept of Wriatbles
> with xml - is from my point of view a bad idea.

One more time. Please read the proposal one more time and my responses.
The proposal doesn't suggest to replace the way data are stored in Nutch.
It is just a proposal of a generic xml parser (as the title suggest it)


> :-) I'm the last that inhibit innovation, but I would love to see
> nutch able to parse billion of pages.

Today, parsing billion of pages is not the only challenge of search engines
(look at Google that no more displays the number of indexed pages)
The parsing of a lot of content types, the language technologies (language
specific stemmatization, analysis, querying, summarization, ...) are some
other new challenges...
The "low level" challenges are importants, but they must not be a brake for
"high level" processes.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Reply via email to