Well, PDF is a complicated beast. A tool like PDFBox is designed to help, but the price is sometimes performance (but heh, it beats not being able to do it). There are commercial converters available, but I can't say that they necessarily perform much better than PDFBox. I've broken a few of them with PDFs that PDFBox handles correctly.

I know of at least two projects that provide frameworks for dealing with PDFs (and other files like Word, etc.): Tika (a Lucene subproject) and Aperture (http://aperture.sourceforge.net) Additionally, w/ PDFBox, there is no need to save the file back out if you just want the text, there are text extractors available that allow you to read in the file and then have the text in memory. This may help w/ your perf. problem, as it is likely that a good deal of time is spent on the I/O. See http://pdfbox.org/userguide/text_extraction.html for doing this.

You might also search the Lucene Java mail archives (http://lucene.markmail.org ) for PDF extraction. This is something many Lucene users have tackled over time, so you may find more insight there.

Out of curiosity, what do you mean by "Hadoop Search"?

Cheers,
Grant

On Jul 30, 2008, at 7:25 AM, GaneshG wrote:


Thanks Joman, i tried pdfbox, it converts pdfs to text files. On these files hadoop search is working fine. but, performance aspect it is not good, since we have to find first the file type is pdf or not then we have to convert it. Also its generating txt files with same name of the original pdf. so if we already have index.txt and we try to convert the index.pdf, then it will
be the problem for searches. Better we have to find someother way...



Joman Chu-2 wrote:

I've been investigating this recently, and I came across Apache PDFBox
(http://incubator.apache.org/projects/pdfbox.html), which may
accomplish this in native Java. Try it out and get back to us on how
well it works, I'd be curious to know.

Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net


On Wed, Jul 23, 2008 at 9:39 AM, Dhruba Borthakur <[EMAIL PROTECTED]>
wrote:
One option for you is to use a pdf-to-text converter (many of them are
available online) and then run map-reduce on the txt file.

-dhruba

On Wed, Jul 23, 2008 at 1:07 AM, GaneshG
<[EMAIL PROTECTED]> wrote:

Thanks Lohit, i am using only defalult reader and i am very new to
hadoop.
This is my map method

public void map(LongWritable key, Text value, OutputCollector<Text,
Text> output, Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {

              String val = tokenizer.nextToken();
              try {

              if (val != null && val.contains("the")) {
                      word.set(line);
                      FileSplit spl =
(FileSplit)reporter.getInputSplit();
                      output.collect(word, new
Text(spl.getPath().getName()));
              }
              } catch (Exception e) {
                      System.out.println(e);
              }
      }
    }
  }

I have a pdf file in my dfs input folder. can you tell me what i have to
do
to read pdf files?

Thanks
Ganesh.G


lohit-2 wrote:

Can you provide more information. How are you passing your input, are
you
passing raw pdf files? If so, are you using your own record reader. Default record reader wont read pdf files and you wont get the text out
of
it as is.
Thanks,
Lohit



----- Original Message ----
From: GaneshG <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, July 23, 2008 1:51:52 AM
Subject: Text search on a PDF file using hadoop


while i search a text in a pdf file using hadoop, the results are not
coming
properly. i tried to debug my program, i could see the lines red from
pdf
file is not formatted. please help me to resolve this.
--
View this message in context:
http://www.nabble.com/Text-search-on-a-PDF-file-using-hadoop-tp18606475p18606475.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



--
View this message in context:
http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18606703.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.







--
View this message in context: 
http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18731134.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Reply via email to