[Nutch-dev] [jira] Commented: (NUTCH-472) NullPointerException in ZipTextExtractor if no MIME type for zipped file

Sami Siren (JIRA) Fri, 11 May 2007 22:29:52 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495229
 ]


Sami Siren commented on NUTCH-472:
----------------------------------

> Not sure how to turn source code in description into a patch file, but the 
> fixed "extractText" method was included earlier.

You can follow instructions on how to create patches on Nutch wiki 
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer


> NullPointerException in ZipTextExtractor if no MIME type for zipped file
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-472
>                 URL: https://issues.apache.org/jira/browse/NUTCH-472
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>         Environment: Any
>            Reporter: Antony Bowesman
>
> extractText throws a NPE in
>           String contentType = MIME.getMimeType(fname).getName();
> if the file in the zip has no configured mime type which breaks the parsing 
> of the zip.
> Code should do:
>   public String extractText(InputStream input, String url, List outLinksList) 
> throws IOException {
>     String resultText = "";
>     byte temp;
>     
>     ZipInputStream zin = new ZipInputStream(input);
>     
>     ZipEntry entry;
>     
>     while ((entry = zin.getNextEntry()) != null) {
>       
>       if (!entry.isDirectory()) {
>         int size = (int) entry.getSize();
>         byte[] b = new byte[size];
>         for(int x = 0; x < size; x++) {
>           int err = zin.read();
>           if(err != -1) {
>             b[x] = (byte)err;
>           }
>         }
>         String newurl = url + "/";
>         String fname = entry.getName();
>         newurl += fname;
>         URL aURL = new URL(newurl);
>         String base = aURL.toString();
>         int i = fname.lastIndexOf('.');
>         if (i != -1) {
>           // Trying to resolve the Mime-Type
>           MimeType mt = MIME.getMimeType(fname);
>           if (mt != null) {
>             String contentType = mt.getName();
>             try {
>               Metadata metadata = new Metadata();
>               metadata.set(Response.CONTENT_LENGTH, 
> Long.toString(entry.getSize()));
>               metadata.set(Response.CONTENT_TYPE, contentType);
>               Content content = new Content(newurl, base, b, contentType, 
> metadata, this.conf);
>               Parse parse = new ParseUtil(this.conf).parse(content);
>               ParseData theParseData = parse.getData();
>               Outlink[] theOutlinks = theParseData.getOutlinks();
>                 
>               for(int count = 0; count < theOutlinks.length; count++) {
>                 outLinksList.add(new Outlink(theOutlinks[count].getToUrl(), 
> theOutlinks[count].getAnchor(), this.conf));
>               }
>               
>               resultText += entry.getName() + " " + parse.getText() + " ";
>             } catch (ParseException e) {
>               if (LOG.isInfoEnabled()) { 
>                LOG.info("fetch okay, but can't parse " + fname + ", reason: " 
> + e.getMessage());
>               }
>             }
>           } else {
>               resultText += entry.getName();
>           }
>         }
>       }
>     }
>     
>     return resultText;
>   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-472) NullPointerException in ZipTextExtractor if no MIME type for zipped file

Reply via email to