Re: Text search on a PDF file using hadoop

Grant Ingersoll Wed, 30 Jul 2008 13:48:44 -0700

Well, PDF is a complicated beast. A tool like PDFBox is designed tohelp, but the price is sometimes performance (but heh, it beats notbeing able to do it). There are commercial converters available, butI can't say that they necessarily perform much better than PDFBox.I've broken a few of them with PDFs that PDFBox handles correctly.

I know of at least two projects that provide frameworks for dealingwith PDFs (and other files like Word, etc.): Tika (a Lucenesubproject) and Aperture (http://aperture.sourceforge.net)Additionally, w/ PDFBox, there is no need to save the file back out ifyou just want the text, there are text extractors available that allowyou to read in the file and then have the text in memory. This mayhelp w/ your perf. problem, as it is likely that a good deal of timeis spent on the I/O. See http://pdfbox.org/userguide/text_extraction.htmlfor doing this.

You might also search the Lucene Java mail archives (http://lucene.markmail.org) for PDF extraction. This is something many Lucene users havetackled over time, so you may find more insight there.


Out of curiosity, what do you mean by "Hadoop Search"?

Cheers,
Grant

On Jul 30, 2008, at 7:25 AM, GaneshG wrote:

Thanks Joman, i tried pdfbox, it converts pdfs to text files. Onthese fileshadoop search is working fine. but, performance aspect it is notgood, sincewe have to find first the file type is pdf or not then we have toconvertit. Also its generating txt files with same name of the originalpdf. so ifwe already have index.txt and we try to convert the index.pdf, thenit will
be the problem for searches. Better we have to find someother way...



Joman Chu-2 wrote:
I've been investigating this recently, and I came across ApachePDFBox
(http://incubator.apache.org/projects/pdfbox.html), which may
accomplish this in native Java. Try it out and get back to us on how
well it works, I'd be curious to know.

Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net


On Wed, Jul 23, 2008 at 9:39 AM, Dhruba Borthakur <[EMAIL PROTECTED]>
wrote:
One option for you is to use a pdf-to-text converter (many of themare
available online) and then run map-reduce on the txt file.

-dhruba

On Wed, Jul 23, 2008 at 1:07 AM, GaneshG
<[EMAIL PROTECTED]> wrote:
Thanks Lohit, i am using only defalult reader and i am very new to
hadoop.
This is my map method
public void map(LongWritable key, Text value,OutputCollector<Text,
Text> output, Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {

              String val = tokenizer.nextToken();
              try {

              if (val != null && val.contains("the")) {
                      word.set(line);
                      FileSplit spl =
(FileSplit)reporter.getInputSplit();
                      output.collect(word, new
Text(spl.getPath().getName()));
              }
              } catch (Exception e) {
                      System.out.println(e);
              }
      }
    }
  }
I have a pdf file in my dfs input folder. can you tell me what ihave to
do
to read pdf files?

Thanks
Ganesh.G


lohit-2 wrote:
Can you provide more information. How are you passing yourinput, are
you
passing raw pdf files? If so, are you using your own recordreader.Default record reader wont read pdf files and you wont get thetext out
of
it as is.
Thanks,
Lohit



----- Original Message ----
From: GaneshG <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, July 23, 2008 1:51:52 AM
Subject: Text search on a PDF file using hadoop
while i search a text in a pdf file using hadoop, the resultsare not
coming
properly. i tried to debug my program, i could see the lines redfrom
pdf
file is not formatted. please help me to resolve this.
--
View this message in context:
http://www.nabble.com/Text-search-on-a-PDF-file-using-hadoop-tp18606475p18606475.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
--
View this message in context:
http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18606703.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
--
View this message in context: 
http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18731134.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Text search on a PDF file using hadoop

Reply via email to