NullPointerException in ZipTextExtractor if no MIME type for zipped file
------------------------------------------------------------------------

                 Key: NUTCH-472
                 URL: https://issues.apache.org/jira/browse/NUTCH-472
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 0.9.0
         Environment: Any
            Reporter: Antony Bowesman


extractText throws a NPE in

          String contentType = MIME.getMimeType(fname).getName();

if the file in the zip has no configured mime type which breaks the parsing of 
the zip.

Code should do:

  public String extractText(InputStream input, String url, List outLinksList) 
throws IOException {
    String resultText = "";
    byte temp;
    
    ZipInputStream zin = new ZipInputStream(input);
    
    ZipEntry entry;
    
    while ((entry = zin.getNextEntry()) != null) {
      
      if (!entry.isDirectory()) {
        int size = (int) entry.getSize();
        byte[] b = new byte[size];
        for(int x = 0; x < size; x++) {
          int err = zin.read();
          if(err != -1) {
            b[x] = (byte)err;
          }
        }
        String newurl = url + "/";
        String fname = entry.getName();
        newurl += fname;
        URL aURL = new URL(newurl);
        String base = aURL.toString();
        int i = fname.lastIndexOf('.');
        if (i != -1) {
          // Trying to resolve the Mime-Type
          MimeType mt = MIME.getMimeType(fname);
          if (mt != null) {
            String contentType = mt.getName();
            try {
              Metadata metadata = new Metadata();
              metadata.set(Response.CONTENT_LENGTH, 
Long.toString(entry.getSize()));
              metadata.set(Response.CONTENT_TYPE, contentType);
              Content content = new Content(newurl, base, b, contentType, 
metadata, this.conf);
              Parse parse = new ParseUtil(this.conf).parse(content);
              ParseData theParseData = parse.getData();
              Outlink[] theOutlinks = theParseData.getOutlinks();
                
              for(int count = 0; count < theOutlinks.length; count++) {
                outLinksList.add(new Outlink(theOutlinks[count].getToUrl(), 
theOutlinks[count].getAnchor(), this.conf));
              }
              
              resultText += entry.getName() + " " + parse.getText() + " ";
            } catch (ParseException e) {
              if (LOG.isInfoEnabled()) { 
               LOG.info("fetch okay, but can't parse " + fname + ", reason: " + 
e.getMessage());
              }
            }
          } else {
              resultText += entry.getName();
          }
        }
      }
    }
    
    return resultText;
  }


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to