GitHub user antheque opened a pull request:
https://github.com/apache/any23/pull/19
Fix for a ThreadSafety issue in ItemPropValue
When multiple HTML documents are parsed concurrently in different threads,
the MicrodataParser will sometimes throw very weird Exceptions or deliver
broken values of date properties. I tracked it to a static SimpleDateFormat
field in ItemPropValue. SimpleDateFormat is not thread-safe and should never be
used like this.
I wrote a unit test that tries to parse the same document in 10 concurrent
threads about 100 times each. Then it gets a value of the "birthday" property.
In all thousand cases the value should be the same. On my machine this test
fails each time with various errors. With the fix - the test passes each time.
The test uses a cyclic barrier and ensures that only in-memory data is
used, so that I/O overhead and thread creation overhead do not interfere with
the actual processing under test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/antheque/any23 master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/any23/pull/19.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19
----
commit 9afa87db7efaad706e64e76a82e6a53f657a817f
Author: Antoni Mylka <[email protected]>
Date: 2015-10-29T14:11:26Z
Replaced the static SimpleDateFormat field in ItemPropValue with a
ThreadLocal. The previous solution would yield broken results when
multiple documents were parsed concurrently. Added a unit test that
failed every time on my machine with the old version and succeeds every
time with the new version.
commit a8f1bd0a3d8b5ea368e25529bcb959d79946c969
Author: Antoni Mylka <[email protected]>
Date: 2015-10-29T14:17:29Z
Reverted two small changes automatically introduced by my the java style
settings of my IDE back to their original state.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---