[basex-talk] How to process very long node sequences

2015-03-18 Thread Hans-Juergen Rennau
Dear BaseX team,
can you help me with the following, very general problem?
I have a huge document in a database and want to iterate over many nodes which 
it contains, performing for each one an action which does not produce a value - 
for example, storing it in a SQL database.
I want to persuade the processor to visit, process and forget the nodes one 
after the other, rather than to attempt loading all nodes into memory before 
proceeding to process them. Schematically:
declare function f:processNode($node as node())  as empty-sequence() {...};

for $node in doc('huge-doc')/a/b/creturn f:processNode($node)
 ~ ~ ~
Well, in some cases it works, in others in doesn't. Is there any safe way how 
to enforce sequential visit-process-forget processing?
I call this problem general because it is a critical aspect of dealing with 
huge documents. 

Cheers,Hans-Jürgen




[basex-talk] Fwd: Re: How to process very long node sequences

2015-03-18 Thread Leonard Wörteler

Wrong mail account...

 Weitergeleitete Nachricht 
Betreff: Re: [basex-talk] How to process very long node sequences
Datum: Wed, 18 Mar 2015 19:26:00 +0100
Von: Leo Wörteler l...@basex.org
An: Hans-Juergen Rennau hren...@yahoo.de
Kopie (CC): basex-talk@mailman.uni-konstanz.de 
basex-talk@mailman.uni-konstanz.de


Dear Hans-Jürgen,

Am 18.03.2015 um 15:23 schrieb Hans-Juergen Rennau:

I want to persuade the processor to visit, process and forget the nodes
one after the other, rather than to attempt loading all nodes into
memory before proceeding to process them.


BaseX already uses iterative processing as default mode. It only falls
back to caching if it has to, e.g. for sorting, reversing or duplicate
elimination. Calls to user-defined functions are currently also
blocking, but we are currently investigating if that can be changed.


Schematically:

declare function f:processNode($node as node())  as empty-sequence() {...};

for $node in doc('huge-doc')/a/b/c
return f:processNode($node)


The example you gave will always be evaluated without caching the node
sequence. The `for` clause requests the result of its argument
iteratively, and the XPath expression you used only contains child
steps, which can be evaluated in document order without duplicates.


Well, in some cases it works, in others in doesn't. Is there any safe
way how to enforce sequential visit-process-forget processing?


When the XPath expression becomes more complex, it is not as easy to
predict if it uses caching internally. BaseX tries quite hard to detect
paths that do not need it, the algorithm can be seen in [1]. If you see
a `CachedPath` in the Info View of the GUI, you can try to reformulate
the query.


I call this problem general because it is a critical aspect of dealing
with huge documents.


XQuery does not have a dedicated *streaming mode* like
that of XSLT 3.0 [2] (yet), but it would definitely be possible to check
if some part of a query (e.g. marked by a pragma) is evaluated without
caching. It would however be quite some work.

Hope that helps,
  Leo

[1]
https://github.com/BaseXdb/basex/blob/0b828a8/basex-core/src/main/java/org/basex/query/expr/path/Path.java#L297-359
[2] http://www.w3.org/TR/xslt-30/#dt-guaranteed-streamable


Re: [basex-talk] Fwd: Re: How to process very long node sequences

2015-03-18 Thread Hans-Juergen Rennau
Thank you very much, Leo, and thanks for your interest, Marc.
An observation I made, and which turned out to be crucial to my concrete 
problem:
caching:
    for $node at $pos in doc('otds-fti')/foo/bar
    return  ... not caching:
    for $node at $pos in doc('otds-fti')/foo/bar/document{.}    return ...  
 

So the final step /document{.}
turns the for loop into caching. Interestingly, the problem can be worked 
around very simply:
not caching:
    for $node at $pos in doc('otds-fti')/foo/bar    let $node := document{.}
    return  ...
I am very glad that the problem could be solved. Your pointing out the 
importance of the expression details 

and the XPath expression you used only contains child steps, which can be 
evaluated in document order without duplicates.
helped me to make the revealing experiment. 

Thanks again -Hans-Jürgen
PS: Interestingly, the following variant is also not caching:

    f:evaluate($expr, $context) {    let $ctx := map{'': $context}    
return xquery:eval($expr $ctx)
    };
    let $context := doc('otds-fti')    for $node at $pos in 
f:evaluate('/foo/bar', $context)    let $node := document{.}
    return  ...

 



 Leonard Wörteler leonard.woerte...@uni-konstanz.de schrieb am 23:12 
Mittwoch, 18.März 2015:
   

 Hi Marc,

Am 18.03.2015 um 21:08 schrieb Marc:
 As I understand the first part of answer, you tell that when there is a
 user defined function Basex do cache, but after when you analyze the
 example, you tell that in this cxase there is no cache else there is a
 f:processNode function in the return.

 Do I you don't understand?

well, kind of. When evaluating a call to a user-defined function (that 
was not inlined during compilation), both the arguments and the result 
of the call are fully evaluated and cached in memory. This is still the 
case in Hans-Jürgen's example.

As I understand his question however, he is mostly concerned about the 
(potentially very big) sequence of nodes that the `for` loop iterates 
over. I just pointed out that that sequence will not normally be cached 
by BaseX.

Hope that clears it up,
  chears, Leo


  

Re: [basex-talk] Fwd: Re: How to process very long node sequences

2015-03-18 Thread Leonard Wörteler

Hi Marc,

Am 18.03.2015 um 21:08 schrieb Marc:

As I understand the first part of answer, you tell that when there is a
user defined function Basex do cache, but after when you analyze the
example, you tell that in this cxase there is no cache else there is a
f:processNode function in the return.

Do I you don't understand?


well, kind of. When evaluating a call to a user-defined function (that 
was not inlined during compilation), both the arguments and the result 
of the call are fully evaluated and cached in memory. This is still the 
case in Hans-Jürgen's example.


As I understand his question however, he is mostly concerned about the 
(potentially very big) sequence of nodes that the `for` loop iterates 
over. I just pointed out that that sequence will not normally be cached 
by BaseX.


Hope that clears it up,
  chears, Leo


Re: [basex-talk] Fwd: Re: How to process very long node sequences

2015-03-18 Thread Marc

Hi Hans-Jürgen,
As I understand the first part of answer, you tell that when there is a 
user defined function Basex do cache, but after when you analyze the 
example, you tell that in this cxase there is no cache else there is a 
f:processNode function in the return.


Do I you don't understand?

Marc

Le 18/03/2015 19:38, Leonard Wörteler a écrit :

Wrong mail account...

 Weitergeleitete Nachricht 
Betreff: Re: [basex-talk] How to process very long node sequences
Datum: Wed, 18 Mar 2015 19:26:00 +0100
Von: Leo Wörteler l...@basex.org
An: Hans-Juergen Rennau hren...@yahoo.de
Kopie (CC): basex-talk@mailman.uni-konstanz.de
basex-talk@mailman.uni-konstanz.de

Dear Hans-Jürgen,

Am 18.03.2015 um 15:23 schrieb Hans-Juergen Rennau:

I want to persuade the processor to visit, process and forget the nodes
one after the other, rather than to attempt loading all nodes into
memory before proceeding to process them.


BaseX already uses iterative processing as default mode. It only falls
back to caching if it has to, e.g. for sorting, reversing or duplicate
elimination. Calls to user-defined functions are currently also
blocking, but we are currently investigating if that can be changed.


Schematically:

declare function f:processNode($node as node())  as empty-sequence()
{...};

for $node in doc('huge-doc')/a/b/c
return f:processNode($node)


The example you gave will always be evaluated without caching the node
sequence. The `for` clause requests the result of its argument
iteratively, and the XPath expression you used only contains child
steps, which can be evaluated in document order without duplicates.


Well, in some cases it works, in others in doesn't. Is there any safe
way how to enforce sequential visit-process-forget processing?


When the XPath expression becomes more complex, it is not as easy to
predict if it uses caching internally. BaseX tries quite hard to detect
paths that do not need it, the algorithm can be seen in [1]. If you see
a `CachedPath` in the Info View of the GUI, you can try to reformulate
the query.


I call this problem general because it is a critical aspect of dealing
with huge documents.


XQuery does not have a dedicated *streaming mode* like
that of XSLT 3.0 [2] (yet), but it would definitely be possible to check
if some part of a query (e.g. marked by a pragma) is evaluated without
caching. It would however be quite some work.

Hope that helps,
   Leo

[1]
https://github.com/BaseXdb/basex/blob/0b828a8/basex-core/src/main/java/org/basex/query/expr/path/Path.java#L297-359

[2] http://www.w3.org/TR/xslt-30/#dt-guaranteed-streamable