I've been investigating this recently, and I came across Apache PDFBox
(http://incubator.apache.org/projects/pdfbox.html), which may
accomplish this in native Java. Try it out and get back to us on how
well it works, I'd be curious to know.
Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net
On Wed, Jul 23, 2008 at 9:39 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote:
> One option for you is to use a pdf-to-text converter (many of them are
> available online) and then run map-reduce on the txt file.
>
> -dhruba
>
> On Wed, Jul 23, 2008 at 1:07 AM, GaneshG
> <[EMAIL PROTECTED]> wrote:
>>
>> Thanks Lohit, i am using only defalult reader and i am very new to hadoop.
>> This is my map method
>>
>> public void map(LongWritable key, Text value, OutputCollector<Text,
>> Text> output, Reporter reporter) throws IOException {
>> String line = value.toString();
>> StringTokenizer tokenizer = new StringTokenizer(line);
>> while (tokenizer.hasMoreTokens()) {
>>
>> String val = tokenizer.nextToken();
>> try {
>>
>> if (val != null && val.contains("the")) {
>> word.set(line);
>> FileSplit spl = (FileSplit)reporter.getInputSplit();
>> output.collect(word, new
>> Text(spl.getPath().getName()));
>> }
>> } catch (Exception e) {
>> System.out.println(e);
>> }
>> }
>> }
>> }
>>
>> I have a pdf file in my dfs input folder. can you tell me what i have to do
>> to read pdf files?
>>
>> Thanks
>> Ganesh.G
>>
>>
>> lohit-2 wrote:
>>>
>>> Can you provide more information. How are you passing your input, are you
>>> passing raw pdf files? If so, are you using your own record reader.
>>> Default record reader wont read pdf files and you wont get the text out of
>>> it as is.
>>> Thanks,
>>> Lohit
>>>
>>>
>>>
>>> ----- Original Message ----
>>> From: GaneshG <[EMAIL PROTECTED]>
>>> To: [email protected]
>>> Sent: Wednesday, July 23, 2008 1:51:52 AM
>>> Subject: Text search on a PDF file using hadoop
>>>
>>>
>>> while i search a text in a pdf file using hadoop, the results are not
>>> coming
>>> properly. i tried to debug my program, i could see the lines red from pdf
>>> file is not formatted. please help me to resolve this.
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Text-search-on-a-PDF-file-using-hadoop-tp18606475p18606475.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18606703.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>
>