[
https://issues.apache.org/jira/browse/JENA-44?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092106#comment-13092106
]
Stephen Allen commented on JENA-44:
-----------------------------------
I did not include a cancellation mechanism in the DataBags themselves because
it was not clear to me that it would be necessary.
The only point at which a significant amount of time can be spent in the
DataBag code is in the add() method right as a spill is occurring. The program
execution may be in Array.sort() (SortedDataBag and DistinctDataBag) or it may
be in the process of serializing tuples to disk. Given anticipated spill
thresholds (1,000-100,000 tuples or memory in the 10-100 MB range), and the
fact that disk I/O is sequential (and thus fast), it seemed like an unnecessary
complication to support cancellation since those operations would complete in
the 10's of seconds range. Any physical query operator using the DataBag would
then be able to cancel immediately after the spill finished (QueryIterSort
passes the cancel request to it's embedded iterator which will then throw the
QueryCancellationException on the next iteration).
After the add phase is complete, and the QueryIterSort starts returning
results, cancellation will be handled by the super class (QueryIteratorBase).
Porting the tests meant that they would test the QueryIterSort with the
embedded DataBag to be sure that the temporary files were cleaned up when the
iterator was cancelled. So it's not really testing cancellation on the DataBag
per say, but rather the new QueryIterSort.
> Support external sorting of bindings in ARQ
> -------------------------------------------
>
> Key: JENA-44
> URL: https://issues.apache.org/jira/browse/JENA-44
> Project: Jena
> Issue Type: New Feature
> Components: ARQ
> Reporter: Sam Tunnicliffe
> Assignee: Paolo Castagna
> Priority: Minor
> Attachments: JENA-44-0.patch,
> JENA-44-Depends-on-JENA-99-r1157891.patch, JENA-44_ARQ_r1156212.patch,
> JENA-44_ARQ_r8531.patch, JENA-44_ARQ_r8724.patch
>
>
> In QueryIterSort, the sorting of the contents of an Iterator<Binding> is done
> in memory, using Arrays.sort. This can be problematic where the set to be
> sorted is large. A possible solution could be to use an external, disk-backed
> algorithm. A hybrid approach may be better, whereby we attempt the in-memory
> sort, but when the number of bindings encountered goes over a certain number,
> resort to the disk-backed variant.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira