[basex-talk] How to process very long node sequences
Dear BaseX team, can you help me with the following, very general problem? I have a huge document in a database and want to iterate over many nodes which it contains, performing for each one an action which does not produce a value - for example, storing it in a SQL database. I want to persuade the processor to visit, process and forget the nodes one after the other, rather than to attempt loading all nodes into memory before proceeding to process them. Schematically: declare function f:processNode($node as node()) as empty-sequence() {...}; for $node in doc('huge-doc')/a/b/creturn f:processNode($node) ~ ~ ~ Well, in some cases it works, in others in doesn't. Is there any safe way how to enforce sequential visit-process-forget processing? I call this problem general because it is a critical aspect of dealing with huge documents. Cheers,Hans-Jürgen
[basex-talk] Fwd: Re: How to process very long node sequences
Wrong mail account... Weitergeleitete Nachricht Betreff: Re: [basex-talk] How to process very long node sequences Datum: Wed, 18 Mar 2015 19:26:00 +0100 Von: Leo Wörteler l...@basex.org An: Hans-Juergen Rennau hren...@yahoo.de Kopie (CC): basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Dear Hans-Jürgen, Am 18.03.2015 um 15:23 schrieb Hans-Juergen Rennau: I want to persuade the processor to visit, process and forget the nodes one after the other, rather than to attempt loading all nodes into memory before proceeding to process them. BaseX already uses iterative processing as default mode. It only falls back to caching if it has to, e.g. for sorting, reversing or duplicate elimination. Calls to user-defined functions are currently also blocking, but we are currently investigating if that can be changed. Schematically: declare function f:processNode($node as node()) as empty-sequence() {...}; for $node in doc('huge-doc')/a/b/c return f:processNode($node) The example you gave will always be evaluated without caching the node sequence. The `for` clause requests the result of its argument iteratively, and the XPath expression you used only contains child steps, which can be evaluated in document order without duplicates. Well, in some cases it works, in others in doesn't. Is there any safe way how to enforce sequential visit-process-forget processing? When the XPath expression becomes more complex, it is not as easy to predict if it uses caching internally. BaseX tries quite hard to detect paths that do not need it, the algorithm can be seen in [1]. If you see a `CachedPath` in the Info View of the GUI, you can try to reformulate the query. I call this problem general because it is a critical aspect of dealing with huge documents. XQuery does not have a dedicated *streaming mode* like that of XSLT 3.0 [2] (yet), but it would definitely be possible to check if some part of a query (e.g. marked by a pragma) is evaluated without caching. It would however be quite some work. Hope that helps, Leo [1] https://github.com/BaseXdb/basex/blob/0b828a8/basex-core/src/main/java/org/basex/query/expr/path/Path.java#L297-359 [2] http://www.w3.org/TR/xslt-30/#dt-guaranteed-streamable
Re: [basex-talk] Fwd: Re: How to process very long node sequences
Thank you very much, Leo, and thanks for your interest, Marc. An observation I made, and which turned out to be crucial to my concrete problem: caching: for $node at $pos in doc('otds-fti')/foo/bar return ... not caching: for $node at $pos in doc('otds-fti')/foo/bar/document{.} return ... So the final step /document{.} turns the for loop into caching. Interestingly, the problem can be worked around very simply: not caching: for $node at $pos in doc('otds-fti')/foo/bar let $node := document{.} return ... I am very glad that the problem could be solved. Your pointing out the importance of the expression details and the XPath expression you used only contains child steps, which can be evaluated in document order without duplicates. helped me to make the revealing experiment. Thanks again -Hans-Jürgen PS: Interestingly, the following variant is also not caching: f:evaluate($expr, $context) { let $ctx := map{'': $context} return xquery:eval($expr $ctx) }; let $context := doc('otds-fti') for $node at $pos in f:evaluate('/foo/bar', $context) let $node := document{.} return ... Leonard Wörteler leonard.woerte...@uni-konstanz.de schrieb am 23:12 Mittwoch, 18.März 2015: Hi Marc, Am 18.03.2015 um 21:08 schrieb Marc: As I understand the first part of answer, you tell that when there is a user defined function Basex do cache, but after when you analyze the example, you tell that in this cxase there is no cache else there is a f:processNode function in the return. Do I you don't understand? well, kind of. When evaluating a call to a user-defined function (that was not inlined during compilation), both the arguments and the result of the call are fully evaluated and cached in memory. This is still the case in Hans-Jürgen's example. As I understand his question however, he is mostly concerned about the (potentially very big) sequence of nodes that the `for` loop iterates over. I just pointed out that that sequence will not normally be cached by BaseX. Hope that clears it up, chears, Leo
Re: [basex-talk] Fwd: Re: How to process very long node sequences
Hi Marc, Am 18.03.2015 um 21:08 schrieb Marc: As I understand the first part of answer, you tell that when there is a user defined function Basex do cache, but after when you analyze the example, you tell that in this cxase there is no cache else there is a f:processNode function in the return. Do I you don't understand? well, kind of. When evaluating a call to a user-defined function (that was not inlined during compilation), both the arguments and the result of the call are fully evaluated and cached in memory. This is still the case in Hans-Jürgen's example. As I understand his question however, he is mostly concerned about the (potentially very big) sequence of nodes that the `for` loop iterates over. I just pointed out that that sequence will not normally be cached by BaseX. Hope that clears it up, chears, Leo
Re: [basex-talk] Fwd: Re: How to process very long node sequences
Hi Hans-Jürgen, As I understand the first part of answer, you tell that when there is a user defined function Basex do cache, but after when you analyze the example, you tell that in this cxase there is no cache else there is a f:processNode function in the return. Do I you don't understand? Marc Le 18/03/2015 19:38, Leonard Wörteler a écrit : Wrong mail account... Weitergeleitete Nachricht Betreff: Re: [basex-talk] How to process very long node sequences Datum: Wed, 18 Mar 2015 19:26:00 +0100 Von: Leo Wörteler l...@basex.org An: Hans-Juergen Rennau hren...@yahoo.de Kopie (CC): basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Dear Hans-Jürgen, Am 18.03.2015 um 15:23 schrieb Hans-Juergen Rennau: I want to persuade the processor to visit, process and forget the nodes one after the other, rather than to attempt loading all nodes into memory before proceeding to process them. BaseX already uses iterative processing as default mode. It only falls back to caching if it has to, e.g. for sorting, reversing or duplicate elimination. Calls to user-defined functions are currently also blocking, but we are currently investigating if that can be changed. Schematically: declare function f:processNode($node as node()) as empty-sequence() {...}; for $node in doc('huge-doc')/a/b/c return f:processNode($node) The example you gave will always be evaluated without caching the node sequence. The `for` clause requests the result of its argument iteratively, and the XPath expression you used only contains child steps, which can be evaluated in document order without duplicates. Well, in some cases it works, in others in doesn't. Is there any safe way how to enforce sequential visit-process-forget processing? When the XPath expression becomes more complex, it is not as easy to predict if it uses caching internally. BaseX tries quite hard to detect paths that do not need it, the algorithm can be seen in [1]. If you see a `CachedPath` in the Info View of the GUI, you can try to reformulate the query. I call this problem general because it is a critical aspect of dealing with huge documents. XQuery does not have a dedicated *streaming mode* like that of XSLT 3.0 [2] (yet), but it would definitely be possible to check if some part of a query (e.g. marked by a pragma) is evaluated without caching. It would however be quite some work. Hope that helps, Leo [1] https://github.com/BaseXdb/basex/blob/0b828a8/basex-core/src/main/java/org/basex/query/expr/path/Path.java#L297-359 [2] http://www.w3.org/TR/xslt-30/#dt-guaranteed-streamable