Re: [basex-talk] skipping empty cells when parsing CSV

Ron Van den Branden Wed, 26 Apr 2023 08:50:45 -0700

Hi Christian,

Fun it is, two birds with one stone! Both [1] and [2] seem to work ,though on [2], node() still seems to be the slightly faster alternative(when knowing that elements can only contain at least one non-whitespacecharacter, or are empty otherwise, as is the case with my data). Sothanks for making me aware of that performance gain anyway.


Many thanks!

Best,

Ron

On 26/04/2023 16:53, Christian Grün wrote:

Hi Ron,

The proposed option has been added to the latest snapshot [1,2].

In addition, we’ve optimized the evaluation of fn:normalize-space. If
it’s applied on element nodes, it will internally be rewritten to a
more efficient representation: E[normalize-space()] →
E[descendant::text()[normalize-space()]].

Have fun,
Christian

[1] https://files.basex.org/releases/latest/
[2] https://docs.basex.org/wiki/CSV_Module#Options


On Thu, Apr 20, 2023 at 3:58 PM Ron Van den Branden
<ron.vdbran...@gmail.com> wrote:

Hi Christian,

As always, many thanks for your lightning-speed help!

The update command appears to be way out of my physical memory league,
but I'm subscribed to the GitHub issue.

Best,

Ron

On 20/04/2023 14:28, Christian Grün wrote:

Hi Ron,

I agree that would be helpful. I’ve added a GitHub issue [1].

As you’ve already indicated, you can post-process your databases
instances. I think the easiest query for that is:

    delete nodes db:get('db')//*[empty(node())]

…followed by an optional db:optimize('db').

Best,
Christian

[1] https://github.com/BaseXdb/basex/issues/2203



On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden
<ron.vdbran...@gmail.com> wrote:

Hi all,

I'm investigating a way of analysing a massive set of > 900.000 CSV files, for 
which the CSV parsing in BaseX seems very useful, producing a db nicely filled 
with documents such as:

<csv>
    <record>
      <ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
      <source.id>bbcy:vev:6860</source.id>
      <card>AA</card>
      <order>0</order>
      <source_field/>
      <source_code/>
      <Annotation>some remarks</Annotation>
      <Annotation_Language>en</Annotation_Language>
      <Annotation_Type/>
      <resource_model/>
      <!-- ... -->
    </record>
    <record>
      <ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
      <source.id>bbcy:vev:6860</source.id>
      <card>BE</card>
      <order>0</order>
      <source_field/>
      <source_code>concept</source_code>
      <Annotation/>
      <Annotation_Language/>
      <Annotation_Type/>
      <resource_model/>
      <!-- ... -->
    </record>

    <!-- ... -->
</csv>

Yet, when querying those documents, I'm noticing how just selecting non-empty 
elements is very slow. For example:

    //source_code[normalize-space()]

...can take over 40 seconds.

Since I don't have control over the source data, it would be really great if 
empty cells could be skipped when parsing CSV files. Of course this could be a 
trivial post-processing step via XSLT / XQuery, but that's unfeasible for that 
mass of data.

Does BaseX provide a way of telling the CSV parser to skip empty cells?

Best,

Ron

Re: [basex-talk] skipping empty cells when parsing CSV

Reply via email to