Re: Performance issue with saving/writing large xlsx files

martin . studer Mon, 08 Aug 2011 13:30:20 -0700

Hi Yegor,

as you suggested I was working on a patch. You can find it at:
Bug 51635 - [PATCH] Improving performance of XSSFSheet#write


Unit tests should pass just fine now. You can find a short description on
my changes in the description of the bug DB entry.

In case of any issues, please let me know.

Best regards and thanks for your support,
Martin


> Martin,
>
> The proposed fix works quite well and boosts the performance of
> XSSFWorkbook#write in almost ten times!
> Below is some statistics that I gathered generating a sheet with 20K
> rows and 10 columns.
>
> There are two tests, see the Java code in the end of this post.
>  (a) testOrdered that writes rows in ascending order. In this case
> most of extra time is spent in sheetData.getRowArray.
>  (b) testUnordered that writes rows in inverse order, i.e. from N to
> 0. In this case ensureRowOrdering does all the heavy lifting.
>
> worksheet size is 20000x10
>
> Time in ms spent in XSSFWorkbook#write:
>
> testOrdered:    24884
> testNotOrdered: 26983
>
> With your fix the time dropped in almost ten times:
>
> testOrdered:    2388
> testNotOrdered: 2619
>
> Unfortunately I can't apply the fix as-is because it breaks unit
> tests, in particular for XSSFSheet#shiftRows and XSSFSheet#removeRow.
> There may be more failures, I didn't fig further.
> If you can work on the patch and ensure that all tests pass that would
> be great. Otherwise it will wait till late August, until we release
> 3.8-beta4.
>
> My test code:
>
>     int numRows = 20000;
>     int numCols = 10;
>
>
>     public void testOrdered() throws Exception {
>         XSSFWorkbook wb = new XSSFWorkbook();
>         XSSFSheet sh = wb.createSheet();
>
>         ByteArrayOutputStream out;
>
>         System.out.println("testOrdered");
>         long t1 = System.currentTimeMillis();
>         for (int i = 0; i < numRows; i++) {
>             XSSFRow row = sh.createRow(i);
>             for (int j = 0; j < numCols; j++) {
>                 XSSFCell cell = row.createCell(j);
>                 cell.setCellValue(i*j);
>             }
>         }
>         long t2 = System.currentTimeMillis();
>         System.out.println("time to create: " + (t2 - t1));
>         out = new ByteArrayOutputStream();
>         wb.write(out);
>         long t3 = System.currentTimeMillis();
>
>         System.out.println("time to write: " + (t3 - t2));
>
>     }
>
>     public void testNotOrdered() throws Exception {
>         XSSFWorkbook wb = new XSSFWorkbook();
>         XSSFSheet sh = wb.createSheet();
>
>         ByteArrayOutputStream out;
>
>
>         System.out.println("testNotOrdered");
>         long t1 = System.currentTimeMillis();
>         for (int i = numRows - 1; i >= 0; i--) {
>             XSSFRow row = sh.createRow(i);
>             for (int j = 0; j < numCols; j++) {
>                 XSSFCell cell = row.createCell(j);
>                 cell.setCellValue(i * j);
>             }
>         }
>         long t2 = System.currentTimeMillis();
>         System.out.println("time to create: " + (t2 - t1));
>         out = new ByteArrayOutputStream();
>         wb.write(out);
>         long t3 = System.currentTimeMillis();
>
>         System.out.println("time to write: " + (t3 - t2));
>
>     }
>
> Cheers,
> Yegor
>
> On Fri, Aug 5, 2011 at 2:45 PM, Martin Studer
> <[email protected]> wrote:
>> Dear Developers,
>>
>>
>>
>> I've recently encountered a performance issue when writing large xlsx
>> files
>> (~ 10^5 rows and about 10 columns). Looking at the POI mailing list I
>> found
>> a discussion that was going in May/June 2010:
>>
>> http://apache-poi.1045710.n5.nabble.com/Performance-Question-with-CTSheetDat
>> aImpl-java-td2335065.html
>>
>>
>>
>> The thread describes some improvements that were implemented to check
>> whether the CTRows have to be reordered or not
>> (XSSFSheet#ensureRowOrdering). However, from what I've found, already
>> the
>> checks that have to be performed to see if rows have to be reordered are
>> quite expensive - especially the calls to  sheetData.getRowArray. I
>> understand that adapting the XMLBeans side of things is not straight
>> forward
>> / practical. However, the thread also describes an alternative approach
>> formulated by Yegor which is "to improve XSSFSheet#createRow so that the
>> array of CTRow beans is kept in a sorted state". To me this seems like a
>> very good idea. Looking at the current codebase it seems like this has
>> not
>> been implemented so far, probably due to other side-effects such a
>> change
>> would cause? Given that the sheet-internal _rows object already is an
>> ordered TreeMap, Yegor's alternative approach could probably be
>> implemented
>> like this:
>>
>>
>>
>> public XSSFRow createRow(int rownum) {
>>
>>        CTRow ctRow;
>>
>>        XSSFRow prev = _rows.get(rownum);
>>
>>        if(prev != null){
>>
>>            ctRow = prev.getCTRow();
>>
>>            ctRow.set(CTRow.Factory.newInstance());
>>
>>        } else {
>>
>>                    if(rownum > _rows.lastKey()) {
>>
>>                                // we can append the new row at the end
>>
>>                                ctRow =
>> worksheet.getSheetData().addNewRow();
>>
>>                    }
>>
>>                    else {
>>
>>                                // random access case ...
>>
>>                                // get number of rows where row index <
>> rownum
>>
>>                                // --> this tells us where our row should
>> go
>>
>>                                int idx = _rows.headMap(rownum).size();
>>
>>                                ctRow =
>> worksheet.getSheetData().insertNewRow(idx);
>>
>>                    }
>>
>>        }
>>
>>        XSSFRow r = new XSSFRow(ctRow, this);
>>
>>        r.setRowNum(rownum);
>>
>>        _rows.put(rownum, r);
>>
>>        return r;
>>
>>    }
>>
>>
>>
>>
>>
>> In case the row does not exist yet, we check whether the row to be
>> inserted
>> is the "highest" row (= has the largest row number). In this case we can
>> simply append the row to the CTSheetData via the addNewRow method.
>> Otherwise
>> we check how many rows there are which have a smaller index. Given the
>> invariant that the CTRows are ordered, this number tells us where we
>> have to
>> insert the new row (via the insertNewRow method).
>>
>>
>>
>> What is your opinion on that? Do you think this would work without
>> causing
>> other issues? Note, I have not been testing this as I'm still trying to
>> figure out how set up the checked out POI code with Eclipse (if somebody
>> has
>> some pointers on how to do that I would be very greatful).
>>
>>
>>
>> Have a good weekend,
>>
>> Martin
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Performance issue with saving/writing large xlsx files

Reply via email to