[api-dev] Extremely slow performance with getCellByPosition

Scott Deerwester Mon, 08 Nov 2010 09:52:30 -0800

On Sun, 07 Nov 2010 18:04:36 -0500 Andrew Pitonyak <and...@pitonyak.org> wrote:
>
> On Fri, 5 Nov 2010 21:24:08 -0400, Scott Deerwester 
> <scott.deerwes...@gmail.com> wrote:
>> The following code (in Python):
>>
>> for r in range(dataRange.StartRow, dataRange.EndRow):
>>     for c in range(dataRange.StartColumn, dataRange.EndColumn):
>>         cell = sheet.getCellByPosition(c,r)
>>
>>
>> takes nearly two hours to run on a reasonably fast workstation for a
>> spreadsheet with 32 columns and ~27,000 rows (about .2 seconds per row). By
>> comparison, opening the file takes about 8 seconds and saving the file to a
>> CSV (which is functionally equivalent to the above) takes a few seconds.
>> The
>> original file was written in Excel as an XLS (not XLSX). This seems
>> impossibly slow. Am I misusing the API somehow? What can I do to speed it
>> up?
>
>
> First, I usually determine which cells are used. Depending on the
> operations, this may be a gross test such as simply finding the largest
> used block.
>
> Next, if I must access the data in the cells, I obtain the data by getting
> the entire range and then calling getDataArray() (or getData, depending on
> my need / usage).
>
> I have a section on timing in AndrewMacor.odt where I search a Calc
> document using different methods.


Thanks for the response, Andrew. I'm having trouble understanding how
to use cursors, cell ranges and cell range data. What I want to do is:

- Create an aggregate that has a particular (rectangular) data range
that is a subset of the used cells.
- Call an efficient accessor (getDataArray sounds promising) to go
over the cells and write them out to a tab-separated file, with some
processing for date formats first.

I don't think it exists, but what I'd like to do would be:

  cellRange = sheet.createRangeFromPosition(startRow, startCol, endRow, endCol)

  for rowObj in cellRange.getDataArray():
    for colNum in range(len(rowObj)):
      value = rowObj[colNum]
      if needSpecialProcessing(startCol + colNum):
        doSpecialProcessing(value)
      else:
        print value

Barring that, I could do something like:

  cellRange = sheet.getDataArray()

  for rowNum in range(len(cellRange)):
    if rowNum >= startRow and rowNum <= endRow:
      rowObj = cellRange[rowNum]
      for colNum in range(len(rowObj)):
        if colNum >= startCol and colNum <= endCol:
          value = rowObj[colNum]
          if needSpecialProcessing(colNum):
            doSpecialProcessing(value)
          else:
            print value

but that ends up retrieving a lot more data than I need, and in case
isn't very elegant. Could you give me a simple example of how you'd
generate something that generates the data array, given a starting and
ending row and column?

[api-dev] Extremely slow performance with getCellByPosition

Reply via email to