Hi Ron,

I agree that would be helpful. I’ve added a GitHub issue [1].

As you’ve already indicated, you can post-process your databases
instances. I think the easiest query for that is:

  delete nodes db:get('db')//*[empty(node())]

…followed by an optional db:optimize('db').

Best,
Christian

[1] https://github.com/BaseXdb/basex/issues/2203



On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden
<ron.vdbran...@gmail.com> wrote:
>
> Hi all,
>
> I'm investigating a way of analysing a massive set of > 900.000 CSV files, 
> for which the CSV parsing in BaseX seems very useful, producing a db nicely 
> filled with documents such as:
>
> <csv>
>   <record>
>     <ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
>     <source.id>bbcy:vev:6860</source.id>
>     <card>AA</card>
>     <order>0</order>
>     <source_field/>
>     <source_code/>
>     <Annotation>some remarks</Annotation>
>     <Annotation_Language>en</Annotation_Language>
>     <Annotation_Type/>
>     <resource_model/>
>     <!-- ... -->
>   </record>
>   <record>
>     <ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
>     <source.id>bbcy:vev:6860</source.id>
>     <card>BE</card>
>     <order>0</order>
>     <source_field/>
>     <source_code>concept</source_code>
>     <Annotation/>
>     <Annotation_Language/>
>     <Annotation_Type/>
>     <resource_model/>
>     <!-- ... -->
>   </record>
>
>   <!-- ... -->
> </csv>
>
> Yet, when querying those documents, I'm noticing how just selecting non-empty 
> elements is very slow. For example:
>
>   //source_code[normalize-space()]
>
> ...can take over 40 seconds.
>
> Since I don't have control over the source data, it would be really great if 
> empty cells could be skipped when parsing CSV files. Of course this could be 
> a trivial post-processing step via XSLT / XQuery, but that's unfeasible for 
> that mass of data.
>
> Does BaseX provide a way of telling the CSV parser to skip empty cells?
>
> Best,
>
> Ron

Reply via email to