The performance improvement is significant for my simple "Laboratory" test.
See attached PNG file showing Yegor's code vs. Poi before the change.
Yegor's Code Changes Inplace:
Note that the AVG Time per Row does not change (O(1) per row leads to O(n) over
all)
File Name Rows Total AVG
Time Mills/
Millis Row
./20000_true.xls 20000 1964 0.1
./24000_true.xls 24000 2019 0.08
./28800_true.xls 28800 2624 0.09
./34560_true.xls 34560 3012 0.09
./41472_true.xls 41472 3409 0.08
./49766_true.xls 49766 4617 0.09
./59719_true.xls 59719 6162 0.1
./71662_true.xls 71662 5705 0.08
./85994_true.xls 85994 7487 0.09
./103192_true.xls 103192 9060 0.09
./123830_true.xls 123830 11875 0.1
./148596_true.xls 148596 13591 0.09
Original Poi (pre row order check)
Notice that the AVG Mills/Row increases linearly with rows
(O(n) which leads to O(n^2) as seen in the Total Time)
File Name Rows Total AVG
Time Mills/
Millis Row
./20000_true.xls 20000 4655 0.23
./24000_true.xls 24000 6398 0.27
./28800_true.xls 28800 7626 0.26
./34560_true.xls 34560 10178 0.29
./41472_true.xls 41472 17647 0.43
./49766_true.xls 49766 24645 0.5
./59719_true.xls 59719 38765 0.65
./71662_true.xls 71662 63009 0.88
./85994_true.xls 85994 98448 1.14
./103192_true.xls 103192 144313 1.4
./123830_true.xls 123830 217893 1.76
./148596_true.xls 148596 320606 2.16
Thanks for your effort's Yegor.
I will test this in the "wild" with real data soon.
Bryce Alcock * Performance Architect * SunGard *
Asset Arena * 377 E. Butterfield Road Suite 800, Lombard, IL 60148 *
Tel 630-986-3006 * Confidential Fax 630-515-1908 *
www.sungard.com/assetarena
-----Original Message-----
From: Yegor Kozlov [mailto:[email protected]]
Sent: Mon 5/24/2010 12:58 AM
To: POI Developers List
Cc: Alcock, Bryce
Subject: Re: Performance Question with CTSheetDataImpl.java
the purpose of calling CTSheetData#setRowArray is to ensure that the
array of CTRow beans in CTSheetData is ordered. The point is that we
don't always need this operation, in many cases rows are already ordered
and re-assigning them to CTSheetData is an unnecessary expensive operation.
Compare two use cases:
case 1: rows are written in increasing order
XSSFSheet sheet = workbook.createSheet();
sheet.createRow(1);
sheet.createRow(2);
sheet.createRow(3);
workbook.write(out); //no need to re-order rows before writing
case 2: row are created in random order
XSSFSheet sheet = workbook.createSheet();
sheet.createRow(3);
sheet.createRow(2);
sheet.createRow(1);
workbook.write(out); //need to re-order rows before writing
In the first case the call of CTSheetData#setRowArray is extra because
the state of CTSheetData match the logical model. In the second case we
do need to call CTSheetData#setRowArray to ensure that rows are written
in ascending order.
So, the performance of XSSFSheet#write depends on how rows were added:
if rows are added in strict ascending order then XSSFSheet#write is
fast. If the order of rows is random or rows were shifted then
XSSFSheet#write involves CTSheetData#setRowArray and becomes more
expensive.
I committed this improvement in r947542. Please try the latest build
from trunk.
Below are two my tests, the only difference is that in the first test
rows are written in ascending order and in the second test in descending.
public static void main(String[] args) throws Exception {
long t1 = System.currentTimeMillis();
XSSFWorkbook wb = new XSSFWorkbook();
XSSFSheet sh = wb.createSheet();
for (int i = 0; i < Short.MAX_VALUE; i++) {
XSSFRow row = sh.createRow(i);
for (int j = 0; j < 5; j++) {
XSSFCell cell = row.createCell(j);
cell.setCellValue(i*j);
}
}
long t2 = System.currentTimeMillis();
System.out.println("generate: " + (t2-t1) + " ms");
FileOutputStream out = new FileOutputStream(new
File("/temp/rows.xlsx"));
wb.write(out);
out.close();
long t3 = System.currentTimeMillis();
System.out.println("write: " + (t3-t2) + " ms");
}
generate: 5488 ms
write: 2507 ms
public static void main(String[] args) throws Exception {
long t1 = System.currentTimeMillis();
XSSFWorkbook wb = new XSSFWorkbook();
XSSFSheet sh = wb.createSheet();
for (int i = Short.MAX_VALUE; i > 0; i--) {
XSSFRow row = sh.createRow(i);
for (int j = 0; j < 5; j++) {
XSSFCell cell = row.createCell(j);
cell.setCellValue(i*j);
}
}
long t2 = System.currentTimeMillis();
System.out.println("generate: " + (t2-t1) + " ms");
FileOutputStream out = new FileOutputStream(new
File("/temp/rows.xlsx"));
wb.write(out); //will involve CTSheetData#setRowArray
out.close();
long t3 = System.currentTimeMillis();
System.out.println("write: " + (t3-t2) + " ms");
}
generate: 9291 ms
write: 94888 ms
The first test is almost 40X faster!
Regards,
Yegor
> Continuing to chip away at CTSheetDataImpl's performance issues.
> However, the problematic code is unlikely to be changed, I have a discussion
> going on the XMLBeans website, and as it turns out that code is most likely
> not going to change, or will not change easily.
>
>
> With that in mind, I started taking a close look at the POI side of things
> and very specifically the problematic code from a POI perspective.
>
>
> At approximately line 2368 of XSSFSheet.java
>
> the following line is the suspect line:
> I have modified the code to break 1 line into 3 the slow line is the last
> one.
>
> CTRow[] lctRow = new CTRow[rArray.size()];
> lctRow = rArray.toArray(lctRow);
> sheetData.setRowArray(lctRow);
>
>
>
> My Question is: in looking at XSSFSheet.java
> are there any assumption that we can make about the state of sheetData (A
> CTSheetData object)
> that would allow us to either overwrite the default setRowArray(lctRow)
> method,
> or side step this call all together by calling a method that is not generic,
> but instead much
> more targeted to the function and process that is currently going on.
>
>
>
>
>
>
>
>
> Here is the full call in context...Take from my modified version of
> XSSFSheet.java
>
>
> protected void write(OutputStream out) throws IOException {
>
> if(worksheet.getColsArray().length == 1) {
> CTCols col = worksheet.getColsArray(0);
> CTCol[] cols = col.getColArray();
> if(cols.length == 0) {
> worksheet.setColsArray(null);
> }
> }
>
> // Now re-generate our CTHyperlinks, if needed
> if(hyperlinks.size()> 0) {
> if(worksheet.getHyperlinks() == null) {
> worksheet.addNewHyperlinks();
> }
> CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
> for(int i=0; i<ctHls.length; i++) {
> // If our sheet has hyperlinks, have them add
> // any relationships that they might need
> XSSFHyperlink hyperlink = hyperlinks.get(i);
> hyperlink.generateRelationIfNeeded(getPackagePart());
> // Now grab their underling object
> ctHls[i] = hyperlink.getCTHyperlink();
> }
> worksheet.getHyperlinks().setHyperlinkArray(ctHls);
> }
>
> CTSheetData sheetData = worksheet.getSheetData();
> ArrayList<CTRow> rArray = new ArrayList<CTRow>(rows.size());
> for(XSSFRow row : rows.values()){
> row.onDocumentWrite();
> rArray.add(row.getCTRow());
> }
> long startTimeSheetData = System.currentTimeMillis();
> CTRow[] lctRow = new CTRow[rArray.size()];
> lctRow = rArray.toArray(lctRow);
> sheetData.setRowArray(lctRow);
> System.out.println("Sheet Time : " +
> (System.currentTimeMillis() - startTimeSheetData));
>
>
> XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
> xmlOptions.setSaveSyntheticDocumentElement(new
> QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
> Map<String, String> map = new HashMap<String, String>();
> map.put(STRelationshipId.type.getName().getNamespaceURI(), "r");
> xmlOptions.setSaveSuggestedPrefixes(map);
> long startTimeWorkSheetSave = System.currentTimeMillis();
>
> worksheet.save(out, xmlOptions);
> System.out.println("Worksheet save Time : " +
> (System.currentTimeMillis() - startTimeWorkSheetSave));
>
> }
>
>
>
> Best Regards
> Bryce Alcock
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]