On Sun, 07 Nov 2010 18:04:36 -0500 Andrew Pitonyak <and...@pitonyak.org> wrote: > > On Fri, 5 Nov 2010 21:24:08 -0400, Scott Deerwester > <scott.deerwes...@gmail.com> wrote: >> The following code (in Python): >> >> for r in range(dataRange.StartRow, dataRange.EndRow): >> for c in range(dataRange.StartColumn, dataRange.EndColumn): >> cell = sheet.getCellByPosition(c,r) >> >> >> takes nearly two hours to run on a reasonably fast workstation for a >> spreadsheet with 32 columns and ~27,000 rows (about .2 seconds per row). By >> comparison, opening the file takes about 8 seconds and saving the file to a >> CSV (which is functionally equivalent to the above) takes a few seconds. >> The >> original file was written in Excel as an XLS (not XLSX). This seems >> impossibly slow. Am I misusing the API somehow? What can I do to speed it >> up? > > > First, I usually determine which cells are used. Depending on the > operations, this may be a gross test such as simply finding the largest > used block. > > Next, if I must access the data in the cells, I obtain the data by getting > the entire range and then calling getDataArray() (or getData, depending on > my need / usage). > > I have a section on timing in AndrewMacor.odt where I search a Calc > document using different methods.
Thanks for the response, Andrew. I'm having trouble understanding how to use cursors, cell ranges and cell range data. What I want to do is: - Create an aggregate that has a particular (rectangular) data range that is a subset of the used cells. - Call an efficient accessor (getDataArray sounds promising) to go over the cells and write them out to a tab-separated file, with some processing for date formats first. I don't think it exists, but what I'd like to do would be: cellRange = sheet.createRangeFromPosition(startRow, startCol, endRow, endCol) for rowObj in cellRange.getDataArray(): for colNum in range(len(rowObj)): value = rowObj[colNum] if needSpecialProcessing(startCol + colNum): doSpecialProcessing(value) else: print value Barring that, I could do something like: cellRange = sheet.getDataArray() for rowNum in range(len(cellRange)): if rowNum >= startRow and rowNum <= endRow: rowObj = cellRange[rowNum] for colNum in range(len(rowObj)): if colNum >= startCol and colNum <= endCol: value = rowObj[colNum] if needSpecialProcessing(colNum): doSpecialProcessing(value) else: print value but that ends up retrieving a lot more data than I need, and in case isn't very elegant. Could you give me a simple example of how you'd generate something that generates the data array, given a starting and ending row and column?