https://issues.apache.org/bugzilla/show_bug.cgi?id=51524
Bug #: 51524
Summary: PapBinTable constructor is slow
Product: POI
Version: 3.8-dev
Platform: PC
Status: NEW
Severity: normal
Priority: P2
Component: HWPF
AssignedTo: [email protected]
ReportedBy: [email protected]
Classification: Unclassified
The current (r1147828) constructor of the PapBinTable class does something
like:
List<PAPX> newPapxs = new LinkedList<PAPX>();
foreach character in docText
do something
List<PAPX> papxs = new LinkedList<PAPX>();
foreach paragraph in paragraphs
do something with papxs
do something with papxs and newPapxs
set this.paragraphs to newPapxs
The problem is that the overall complexity rises quadratically with the
document size. For instance I have a document which has 341742 paragraphs and
docText at this point is 653186 characters log. I didn't even have the patience
to wait until it finishes.
In 3.7, this constructor was much simpler, this.paragraphs was not transformed.
This introduced a performance regression. We had an experiment where we
processed and indexed the content of some doc files. The time rose from 9 to 54
minutes between 3.7 and 3.8.beta3.
The document I talked about comes from the govdocs dataset. It's public.
http://domex.nps.edu/corp/files/govdocs1/007/007488.doc
There is probably a good reason for this, but the performance regression is
significant and the previous version seems to have worked well enough. Maybe
this transformation could be disabled with some switch or a system property.
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]