One simple approach could use the text extraction functionality of POI and
only compare the extracted text of the two workbooks, maybe that already
provides the wanted "is equal" functionality?

Dominik

On Aug 1, 2016 6:35 PM, "Javen O'Neal" <[email protected]> wrote:

Within your !hash1.equals(hash2) code, you could save the byte streams to
disk, extract the zip files, and use a diff utility to figure out why
they're different.

We use a lot of hash maps, which have nondeterministic order. If those maps
are serialized out as <node attr1="value1" attr2="value2"/>, then attrs
could be swapped. Some values are saved as <node
attr1="key1=value1;key2=value2" />, which would also have those problems.

Not only do you need to diff your xml in a canonical form (order of XML
nodes and attributes, whitespace, self-closing vs paired closing tags,
etc), but any data in attributes (such as the serialized map above, ids
that link portions of an XML document or between multiple XML documents)
also needs to be considered. For example, do you consider two workbooks
that print identically where one work book's style table has an extra
unused style in its StyleTable to be different? How about to workbooks
where two cell styles are swapped, and all references to those styles are
updated. The answer might be no for some purposes and yes for others.

On Aug 1, 2016 06:10, "Nick Burch" <[email protected]> wrote:

> On Mon, 1 Aug 2016, [email protected] wrote:
>
>> we've been experiencing an indeterminism problem with POI's xlsx format,
>> when generating hash values with the following method in testng test
cases:
>>
>
> XLSX uses Zip files, which contain within them file dates. If you're
> comparing the outer zip file, you would expect an otherwise identical file
> to change every second as the datetimes move on
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to