Hi Alex
I don't know... but I'm interested in this point as you.
I'm downloading the nutch source code to found out it debugging.
If you are researching about this issue, we can share results.
I googled about it, and I found nothing.
Bye,
Ernesto.
Alex Quezada escribió:
I've been trying to parse files end in ps.gz or pdf.gz. I would
expect that parse-zip would handle them first, and then based on the
new filetype, pass it on to parse-pdf. However, from looking at the
parse-zip source, it seems that it only attempts to extract text from
the zipped file directly.
But what's really strange is that for some files it goes (at least in
the hadoop log) to parse-zip, and other times directly to parse-pdf.
Anyone know where the filetype matching code is? I'm wondering if the
regex has a bug and it sometimes matches on the first part of the file
type (ie ps instead of gz for 'ps.gz').
Thanks,
Alex
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas