Hi Mohamed, RE: #1 you are definitely headed in the right direction, but I can't directly tell you if that's the "correct" number of files :)
RE: #2, go for it on the log4j issue. Cheers, Chris -----Original Message----- From: Mohamed Mustafa Rafik Khimani <[email protected]> Date: Saturday, March 1, 2014 9:43 PM To: Chris Mattmann <[email protected]> Subject: Re: CSCI ASSIGNMENT QUESTION >Hello professor Mattmann, > > >Thank you for replying to my doubts. > > >I realized there was a small mistake in the above code. I was updating >the same pdf file count for every keyword that was matched for the same >file, instead of updating the count only once for any of the keywords >that matched a file. > > >My output statistics were as follows: > > >Keyword(s) used: UFO, flying disc, disc, saucer, extraterrestrial craft, >flying saucer >No of files processed: 2067 >No of files containing keyword(s): 1139 > > >No of occurrences of each keyword: >---------------------------------- >UFO: 121 >flying disc: 6 >disc: 989 >saucer: 14 >extraterrestrial craft: 0 >flying saucer: 9 > > > >Whereas, I think the correct output count should be as follows: > > >Keyword(s) used: UFO, flying disc, disc, saucer, extraterrestrial craft, >flying saucer >No of files processed: 2067 >No of files containing keyword(s): 999 > > >No of occurrences of each keyword: >---------------------------------- >UFO: 121 >flying disc: 6 >disc: 989 >saucer: 14 >extraterrestrial craft: 0 >flying saucer: 9 > > > >Please let me know if my understanding is correct. > > >I am yet to look at the Tika-93 issue, but I have a couple of doubts >apart from that: > > >1. In order to check if a keyword is present in the pdf file, I am using >the "contains" method in String class > > > //Use the parser to parse each PDF file >parser.parse(is, handler, metadata, new ParseContext()); > >//Get the content of the pdf files as a string >String content = handler.toString(); > >boolean contains = false; > >//For every keyword, we check to see if it is present in the file and >update keyword_counts and num_fileswithkeywords accordingly >for(String keyword:keywords) >{ >if(content.contains(keyword)) >{ > > > >I wanted to know if this was the correct way to do it, or may be I am >missing something here ? > > >2. I get a Log4j warning, each time I run my program: > > >log4j:WARN No appenders could be found for logger >(org.apache.pdfbox.pdfparser.XrefTrailerResolver). >log4j:WARN Please initialize the log4j system properly. > > > >I looked up on the net and found a solution for it, but I would need to >include the log4j jar file. >I wanted to ask you if I should go ahead with this and also at the time >of submission do I need to include the log4j jar file ? I understand the >command to compile and run the program will change slightly and I will >include that in the readme.txt file. > > >I plan to resolve the above 2 points before I look to see the "check OCR >quality before proceeding" step. > > >Thank you so much for your time. > > >Sincerely, > > >Mohamed Mustafa Khimani > > > > > >On Sat, Feb 22, 2014 at 5:34 PM, Mattmann, Chris A (3980) ><[email protected]> wrote: > >Hi Mohamed, > >Thank you for your question. Your code below looks like >it's accomplishing the basics, and the requirements of >assignment #1. BTW, I'm CC'ing [email protected]. > >The (optional check OCR quality) refers to the fact that >in Tika 1.5, we rely on PDF parsing code that doesn't always >get the text chars correctly out of PDF files, so this "check >OCR quality" step was suggesting that if you can figure out a >way to check that you may want to. Or alternatively, you may >want to check out TIKA-93 [1] and Grant's work there and help >me test it out. > >Cheers, >Chris > >[1] https://issues.apache.org/jira/browse/TIKA-93 > > > > >-----Original Message----- >From: Mohamed Mustafa Rafik Khimani <[email protected]> > >Date: Saturday, February 22, 2014 5:23 PM >To: Chris Mattmann <[email protected]> >Subject: Re: CSCI ASSIGNMENT QUESTION > >>Hello Professor Mattmann, >> >> >>Thank you for your reply. >> >> >>Currently, I am doing the following: >> >> >>InputStream is = new BufferedInputStream(new FileInputStream(f)); >> >>Parser parser = new PDFParser(); >> >>ContentHandler handler = new BodyContentHandler(-1); >> >> >>Metadata metadata = new Metadata(); >> >> >> //Use the parser to parse each PDF file >>parser.parse(is, handler, metadata, new ParseContext()); >> >> //Get the content of the pdf files as a string >>String content = handler.toString(); >> >> >> >>//For every keyword, we check to see if it is present in the file and >>update keyword_counts and num_fileswithkeywords accordingly >> >>for(String keyword:keywords) >>{ >>if(content.contains(keyword)) >>{ >>updatelog(keyword,f.getName()); >>int count = keyword_counts.get(keyword); >>count++; >>keyword_counts.put(keyword, count); >>num_fileswithkeywords++; >>} >>} >> >> >> >>I do not understand what "(optional) check OCR quality before proceeding" >>mean. Could you guide me where I can look for this. >> >> >>The above code is processing all pdf files and printing the output as >>needed. Could you please let me know if anything else is needed for the >>assignment, except the optional OCR quality check. >> >> >>Sincerely, >> >> >>Mohamed Mustafa Khimani >> >> >> >> >> >>On Wed, Feb 19, 2014 at 7:33 PM, Chris Mattmann >><[email protected]> wrote: >> >>Thanks for your question Mohamed, feel free to send these >>types of questions to [email protected]. It would be a >>great place to ask them and tell your classmates too. >> >>I'm copying the list on this message. >> >>(BTW you can then find the mail in Google and other >>mail archives after that) >> >>Sometimes the MIME type is incorrectly detected, and >>the best bet is to file a JIRA issue here in Tika: >> >>https://issues.apache.org/jira/browse/TIKA >> >>and then attach the sample PDF file for testing. >> >>If you have to preprocess a file in your specific >>assignment in CS572, that's fine too you can just >>force it to automatically call the PDF parser by >>calling it directly from your program or Java code >>and then bypass that step. >> >>HTH! >> >>Cheers, >>Chris >> >> >>------------------------ >>Chris Mattmann >>[email protected] >> >> >> >> >>-----Original Message----- >>From: Mohamed Mustafa Rafik Khimani <[email protected]> >>Date: Wednesday, February 19, 2014 12:56 PM >>To: Chris Mattmann <[email protected]> >>Subject: CSCI ASSIGNMENT QUESTION >> >>>Hello Professor Mattmann, >>>I have a doubt regarding the Tika assignment. I was trying to read one >>>of >>>the pdf files downloaded from the vault. I was unable to read the file >>>using Tika class and the parse method, which was returning null for each >>>line. >>> >>>When I tried to use the detect method, to check the Mime type of the >>>file, it returns audio/mpeg. >>> >>>I tried using one of the known pdf files, which returned the correct >>>mime >>>type as well as was able to parse the file correctly. >>> >>>I wanted to confirm if I need to pre-process the file in anyway before I >>>can extract the contents or if there might be a potential issue with the >>>pdf files that I have downloaded, and may be consider re-downloading >>>them >>>? >>> >>>I am following the Tika in Action book. I have read the first 4 chapters >>>and will be reading the content extraction chapter next. I was trying a >>>few things while reading the text, so thought of asking you if this is >>>expected or if I am going wrong somewhere. >>> >>>Thank you for your time. >>> >>>Sincerely, >>> >>>Mohamed Mustafa Khimani >>> >> >> >> >> >> >> >> >> >> > > > > > > > >
