Re: how are filetypes mapped in Nutch

Ernesto De Santis Wed, 20 Sep 2006 06:18:30 -0700

Hi Alex

I don't know... but I'm interested in this point as you.
I'm downloading the nutch source code to found out it debugging.


If you are researching about this issue, we can share results.

I googled about it, and I found nothing.

Bye,
Ernesto.



Alex Quezada escribió:

I've been trying to parse files end in ps.gz or pdf.gz. I wouldexpect that parse-zip would handle them first, and then based on thenew filetype, pass it on to parse-pdf. However, from looking at theparse-zip source, it seems that it only attempts to extract text fromthe zipped file directly.
But what's really strange is that for some files it goes (at least inthe hadoop log) to parse-zip, and other times directly to parse-pdf.Anyone know where the filetype matching code is? I'm wondering if theregex has a bug and it sometimes matches on the first part of the filetype (ie ps instead of gz for 'ps.gz').
Thanks,

Alex


        
        
                
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).

¡Probalo ya!http://www.yahoo.com.ar/respuestas

Re: how are filetypes mapped in Nutch

Reply via email to