Hi Christian,
Fun it is, two birds with one stone! Both [1] and [2] seem to work ,
though on [2], node() still seems to be the slightly faster alternative
(when knowing that elements can only contain at least one non-whitespace
character, or are empty otherwise, as is the case with my data). So
thanks for making me aware of that performance gain anyway.
Many thanks!
Best,
Ron
On 26/04/2023 16:53, Christian Grün wrote:
Hi Ron,
The proposed option has been added to the latest snapshot [1,2].
In addition, we’ve optimized the evaluation of fn:normalize-space. If
it’s applied on element nodes, it will internally be rewritten to a
more efficient representation: E[normalize-space()] →
E[descendant::text()[normalize-space()]].
Have fun,
Christian
[1] https://files.basex.org/releases/latest/
[2] https://docs.basex.org/wiki/CSV_Module#Options
On Thu, Apr 20, 2023 at 3:58 PM Ron Van den Branden
<ron.vdbran...@gmail.com> wrote:
Hi Christian,
As always, many thanks for your lightning-speed help!
The update command appears to be way out of my physical memory league,
but I'm subscribed to the GitHub issue.
Best,
Ron
On 20/04/2023 14:28, Christian Grün wrote:
Hi Ron,
I agree that would be helpful. I’ve added a GitHub issue [1].
As you’ve already indicated, you can post-process your databases
instances. I think the easiest query for that is:
delete nodes db:get('db')//*[empty(node())]
…followed by an optional db:optimize('db').
Best,
Christian
[1] https://github.com/BaseXdb/basex/issues/2203
On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden
<ron.vdbran...@gmail.com> wrote:
Hi all,
I'm investigating a way of analysing a massive set of > 900.000 CSV files, for
which the CSV parsing in BaseX seems very useful, producing a db nicely filled
with documents such as:
<csv>
<record>
<ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
<source.id>bbcy:vev:6860</source.id>
<card>AA</card>
<order>0</order>
<source_field/>
<source_code/>
<Annotation>some remarks</Annotation>
<Annotation_Language>en</Annotation_Language>
<Annotation_Type/>
<resource_model/>
<!-- ... -->
</record>
<record>
<ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
<source.id>bbcy:vev:6860</source.id>
<card>BE</card>
<order>0</order>
<source_field/>
<source_code>concept</source_code>
<Annotation/>
<Annotation_Language/>
<Annotation_Type/>
<resource_model/>
<!-- ... -->
</record>
<!-- ... -->
</csv>
Yet, when querying those documents, I'm noticing how just selecting non-empty
elements is very slow. For example:
//source_code[normalize-space()]
...can take over 40 seconds.
Since I don't have control over the source data, it would be really great if
empty cells could be skipped when parsing CSV files. Of course this could be a
trivial post-processing step via XSLT / XQuery, but that's unfeasible for that
mass of data.
Does BaseX provide a way of telling the CSV parser to skip empty cells?
Best,
Ron