[ https://issues.apache.org/jira/browse/NUTCH-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sami Siren resolved NUTCH-473. ------------------------------ Resolution: Duplicate duplicate of NUTCH-456 > ExcelExtractor performance bad due to String concatenation > ---------------------------------------------------------- > > Key: NUTCH-473 > URL: https://issues.apache.org/jira/browse/NUTCH-473 > Project: Nutch > Issue Type: Improvement > Components: indexer > Affects Versions: 0.9.0 > Environment: Tested under Windows, Java 1.5 and 1.6 > Reporter: Antony Bowesman > > Using 0.9 version of ExcelExtractor was still running after 4 hours at 100% > CPU trying to extract the text from a 3MB Excel file containing 26 sheets, > half with a matrix of approx 1100 rows x P columns and the others with approx > 1000 rows x E columns. > After changing ExcelExtractor to use StringBuffer the same extraction process > took 3 seconds under Java 1.5. Code changes below - example uses a 4K buffer > per sheet - this was a completely arbitrary choice but keeps the number of > StringBuffer expansions low for large files without using too much space for > small files. > > protected String extractText(InputStream input) throws Exception { > > String resultText = ""; > HSSFWorkbook wb = new HSSFWorkbook(input); > if (wb == null) { > return resultText; > } > > HSSFSheet sheet; > HSSFRow row; > HSSFCell cell; > int sNum = 0; > int rNum = 0; > int cNum = 0; > > sNum = wb.getNumberOfSheets(); > > // Allow 4K per sheet - seems a reasonable start > StringBuffer sb = new StringBuffer(4096 * sNum); > for (int i=0; i<sNum; i++) { > if ((sheet = wb.getSheetAt(i)) == null) { > continue; > } > rNum = sheet.getLastRowNum(); > for (int j=0; j<=rNum; j++) { > if ((row = sheet.getRow(j)) == null){ > continue; > } > cNum = row.getLastCellNum(); > > for (int k=0; k<cNum; k++) { > if ((cell = row.getCell((short) k)) != null) { > /*if(HSSFDateUtil.isCellDateFormatted(cell) == true) { > resultText += cell.getDateCellValue().toString() + " "; > } else > */ > if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING) { > sb.append(cell.getStringCellValue()); > sb.append(' '); > // resultText += cell.getStringCellValue() + " "; > } else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC) { > Double d = new Double(cell.getNumericCellValue()); > sb.append(d.toString()); > sb.append(' '); > // resultText += d.toString() + " "; > } > /* else if(cell.getCellType() == HSSFCell.CELL_TYPE_FORMULA){ > resultText += cell.getCellFormula() + " "; > } > */ > } > } > } > } > return sb.toString(); > } > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers