Re: [basex-talk] skipping empty cells when parsing CSV

2023-04-26 Thread Christian Grün
>
> Both [1] and [2] seem to work ,
> though on [2], node() still seems to be the slightly faster alternative
> (when knowing that elements can only contain at least one non-whitespace
> character, or are empty otherwise, as is the case with my data).


Exactly: If you write your own queries and if you know your data, node() is
the better choice. Thanks for testing!


Re: [basex-talk] skipping empty cells when parsing CSV

2023-04-26 Thread Ron Van den Branden

Hi Christian,

Fun it is, two birds with one stone! Both [1] and [2] seem to work , 
though on [2], node() still seems to be the slightly faster alternative 
(when knowing that elements can only contain at least one non-whitespace 
character, or are empty otherwise, as is the case with my data). So 
thanks for making me aware of that performance gain anyway.


Many thanks!

Best,

Ron

On 26/04/2023 16:53, Christian Grün wrote:

Hi Ron,

The proposed option has been added to the latest snapshot [1,2].

In addition, we’ve optimized the evaluation of fn:normalize-space. If
it’s applied on element nodes, it will internally be rewritten to a
more efficient representation: E[normalize-space()] →
E[descendant::text()[normalize-space()]].

Have fun,
Christian

[1] https://files.basex.org/releases/latest/
[2] https://docs.basex.org/wiki/CSV_Module#Options


On Thu, Apr 20, 2023 at 3:58 PM Ron Van den Branden
 wrote:

Hi Christian,

As always, many thanks for your lightning-speed help!

The update command appears to be way out of my physical memory league,
but I'm subscribed to the GitHub issue.

Best,

Ron

On 20/04/2023 14:28, Christian Grün wrote:

Hi Ron,

I agree that would be helpful. I’ve added a GitHub issue [1].

As you’ve already indicated, you can post-process your databases
instances. I think the easiest query for that is:

delete nodes db:get('db')//*[empty(node())]

…followed by an optional db:optimize('db').

Best,
Christian

[1] https://github.com/BaseXdb/basex/issues/2203



On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden
 wrote:

Hi all,

I'm investigating a way of analysing a massive set of > 900.000 CSV files, for 
which the CSV parsing in BaseX seems very useful, producing a db nicely filled 
with documents such as:



  3a92-d10e-585e-84a7-29ad17c5799f
  bbcy:vev:6860
  AA
  0
  
  
  some remarks
  en
  
  
  


  3a92-d10e-585e-84a7-29ad17c5799f
  bbcy:vev:6860
  BE
  0
  
  concept
  
  
  
  
  





Yet, when querying those documents, I'm noticing how just selecting non-empty 
elements is very slow. For example:

//source_code[normalize-space()]

...can take over 40 seconds.

Since I don't have control over the source data, it would be really great if 
empty cells could be skipped when parsing CSV files. Of course this could be a 
trivial post-processing step via XSLT / XQuery, but that's unfeasible for that 
mass of data.

Does BaseX provide a way of telling the CSV parser to skip empty cells?

Best,

Ron


Re: [basex-talk] skipping empty cells when parsing CSV

2023-04-26 Thread Christian Grün
Hi Ron,

The proposed option has been added to the latest snapshot [1,2].

In addition, we’ve optimized the evaluation of fn:normalize-space. If
it’s applied on element nodes, it will internally be rewritten to a
more efficient representation: E[normalize-space()] →
E[descendant::text()[normalize-space()]].

Have fun,
Christian

[1] https://files.basex.org/releases/latest/
[2] https://docs.basex.org/wiki/CSV_Module#Options


On Thu, Apr 20, 2023 at 3:58 PM Ron Van den Branden
 wrote:
>
> Hi Christian,
>
> As always, many thanks for your lightning-speed help!
>
> The update command appears to be way out of my physical memory league,
> but I'm subscribed to the GitHub issue.
>
> Best,
>
> Ron
>
> On 20/04/2023 14:28, Christian Grün wrote:
> > Hi Ron,
> >
> > I agree that would be helpful. I’ve added a GitHub issue [1].
> >
> > As you’ve already indicated, you can post-process your databases
> > instances. I think the easiest query for that is:
> >
> >delete nodes db:get('db')//*[empty(node())]
> >
> > …followed by an optional db:optimize('db').
> >
> > Best,
> > Christian
> >
> > [1] https://github.com/BaseXdb/basex/issues/2203
> >
> >
> >
> > On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden
> >  wrote:
> >> Hi all,
> >>
> >> I'm investigating a way of analysing a massive set of > 900.000 CSV files, 
> >> for which the CSV parsing in BaseX seems very useful, producing a db 
> >> nicely filled with documents such as:
> >>
> >> 
> >>
> >>  3a92-d10e-585e-84a7-29ad17c5799f
> >>  bbcy:vev:6860
> >>  AA
> >>  0
> >>  
> >>  
> >>  some remarks
> >>  en
> >>  
> >>  
> >>  
> >>
> >>
> >>  3a92-d10e-585e-84a7-29ad17c5799f
> >>  bbcy:vev:6860
> >>  BE
> >>  0
> >>  
> >>  concept
> >>  
> >>  
> >>  
> >>  
> >>  
> >>
> >>
> >>
> >> 
> >>
> >> Yet, when querying those documents, I'm noticing how just selecting 
> >> non-empty elements is very slow. For example:
> >>
> >>//source_code[normalize-space()]
> >>
> >> ...can take over 40 seconds.
> >>
> >> Since I don't have control over the source data, it would be really great 
> >> if empty cells could be skipped when parsing CSV files. Of course this 
> >> could be a trivial post-processing step via XSLT / XQuery, but that's 
> >> unfeasible for that mass of data.
> >>
> >> Does BaseX provide a way of telling the CSV parser to skip empty cells?
> >>
> >> Best,
> >>
> >> Ron


Re: [basex-talk] skipping empty cells when parsing CSV

2023-04-20 Thread Ron Van den Branden

Hi Christian,

As always, many thanks for your lightning-speed help!

The update command appears to be way out of my physical memory league, 
but I'm subscribed to the GitHub issue.


Best,

Ron

On 20/04/2023 14:28, Christian Grün wrote:

Hi Ron,

I agree that would be helpful. I’ve added a GitHub issue [1].

As you’ve already indicated, you can post-process your databases
instances. I think the easiest query for that is:

   delete nodes db:get('db')//*[empty(node())]

…followed by an optional db:optimize('db').

Best,
Christian

[1] https://github.com/BaseXdb/basex/issues/2203



On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden
 wrote:

Hi all,

I'm investigating a way of analysing a massive set of > 900.000 CSV files, for 
which the CSV parsing in BaseX seems very useful, producing a db nicely filled 
with documents such as:


   
 3a92-d10e-585e-84a7-29ad17c5799f
 bbcy:vev:6860
 AA
 0
 
 
 some remarks
 en
 
 
 
   
   
 3a92-d10e-585e-84a7-29ad17c5799f
 bbcy:vev:6860
 BE
 0
 
 concept
 
 
 
 
 
   

   


Yet, when querying those documents, I'm noticing how just selecting non-empty 
elements is very slow. For example:

   //source_code[normalize-space()]

...can take over 40 seconds.

Since I don't have control over the source data, it would be really great if 
empty cells could be skipped when parsing CSV files. Of course this could be a 
trivial post-processing step via XSLT / XQuery, but that's unfeasible for that 
mass of data.

Does BaseX provide a way of telling the CSV parser to skip empty cells?

Best,

Ron


Re: [basex-talk] skipping empty cells when parsing CSV

2023-04-20 Thread Christian Grün
Hi Ron,

I agree that would be helpful. I’ve added a GitHub issue [1].

As you’ve already indicated, you can post-process your databases
instances. I think the easiest query for that is:

  delete nodes db:get('db')//*[empty(node())]

…followed by an optional db:optimize('db').

Best,
Christian

[1] https://github.com/BaseXdb/basex/issues/2203



On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden
 wrote:
>
> Hi all,
>
> I'm investigating a way of analysing a massive set of > 900.000 CSV files, 
> for which the CSV parsing in BaseX seems very useful, producing a db nicely 
> filled with documents such as:
>
> 
>   
> 3a92-d10e-585e-84a7-29ad17c5799f
> bbcy:vev:6860
> AA
> 0
> 
> 
> some remarks
> en
> 
> 
> 
>   
>   
> 3a92-d10e-585e-84a7-29ad17c5799f
> bbcy:vev:6860
> BE
> 0
> 
> concept
> 
> 
> 
> 
> 
>   
>
>   
> 
>
> Yet, when querying those documents, I'm noticing how just selecting non-empty 
> elements is very slow. For example:
>
>   //source_code[normalize-space()]
>
> ...can take over 40 seconds.
>
> Since I don't have control over the source data, it would be really great if 
> empty cells could be skipped when parsing CSV files. Of course this could be 
> a trivial post-processing step via XSLT / XQuery, but that's unfeasible for 
> that mass of data.
>
> Does BaseX provide a way of telling the CSV parser to skip empty cells?
>
> Best,
>
> Ron