Well, PDF is a complicated beast. A tool like PDFBox is designed to
help, but the price is sometimes performance (but heh, it beats not
being able to do it). There are commercial converters available, but
I can't say that they necessarily perform much better than PDFBox.
I've broken a few of them with PDFs that PDFBox handles correctly.
I know of at least two projects that provide frameworks for dealing
with PDFs (and other files like Word, etc.): Tika (a Lucene
subproject) and Aperture (http://aperture.sourceforge.net)
Additionally, w/ PDFBox, there is no need to save the file back out if
you just want the text, there are text extractors available that allow
you to read in the file and then have the text in memory. This may
help w/ your perf. problem, as it is likely that a good deal of time
is spent on the I/O. See http://pdfbox.org/userguide/text_extraction.html
for doing this.
You might also search the Lucene Java mail archives (http://lucene.markmail.org
) for PDF extraction. This is something many Lucene users have
tackled over time, so you may find more insight there.
Out of curiosity, what do you mean by "Hadoop Search"?
Cheers,
Grant
On Jul 30, 2008, at 7:25 AM, GaneshG wrote:
Thanks Joman, i tried pdfbox, it converts pdfs to text files. On
these files
hadoop search is working fine. but, performance aspect it is not
good, since
we have to find first the file type is pdf or not then we have to
convert
it. Also its generating txt files with same name of the original
pdf. so if
we already have index.txt and we try to convert the index.pdf, then
it will
be the problem for searches. Better we have to find someother way...
Joman Chu-2 wrote:
I've been investigating this recently, and I came across Apache
PDFBox
(http://incubator.apache.org/projects/pdfbox.html), which may
accomplish this in native Java. Try it out and get back to us on how
well it works, I'd be curious to know.
Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net
On Wed, Jul 23, 2008 at 9:39 AM, Dhruba Borthakur <[EMAIL PROTECTED]>
wrote:
One option for you is to use a pdf-to-text converter (many of them
are
available online) and then run map-reduce on the txt file.
-dhruba
On Wed, Jul 23, 2008 at 1:07 AM, GaneshG
<[EMAIL PROTECTED]> wrote:
Thanks Lohit, i am using only defalult reader and i am very new to
hadoop.
This is my map method
public void map(LongWritable key, Text value,
OutputCollector<Text,
Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String val = tokenizer.nextToken();
try {
if (val != null && val.contains("the")) {
word.set(line);
FileSplit spl =
(FileSplit)reporter.getInputSplit();
output.collect(word, new
Text(spl.getPath().getName()));
}
} catch (Exception e) {
System.out.println(e);
}
}
}
}
I have a pdf file in my dfs input folder. can you tell me what i
have to
do
to read pdf files?
Thanks
Ganesh.G
lohit-2 wrote:
Can you provide more information. How are you passing your
input, are
you
passing raw pdf files? If so, are you using your own record
reader.
Default record reader wont read pdf files and you wont get the
text out
of
it as is.
Thanks,
Lohit
----- Original Message ----
From: GaneshG <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, July 23, 2008 1:51:52 AM
Subject: Text search on a PDF file using hadoop
while i search a text in a pdf file using hadoop, the results
are not
coming
properly. i tried to debug my program, i could see the lines red
from
pdf
file is not formatted. please help me to resolve this.
--
View this message in context:
http://www.nabble.com/Text-search-on-a-PDF-file-using-hadoop-tp18606475p18606475.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
--
View this message in context:
http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18606703.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
--
View this message in context:
http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18731134.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.