Re: [basex-talk] skipping empty cells when parsing CSV
> > Both [1] and [2] seem to work , > though on [2], node() still seems to be the slightly faster alternative > (when knowing that elements can only contain at least one non-whitespace > character, or are empty otherwise, as is the case with my data). Exactly: If you write your own queries and if you know your data, node() is the better choice. Thanks for testing!
Re: [basex-talk] skipping empty cells when parsing CSV
Hi Christian, Fun it is, two birds with one stone! Both [1] and [2] seem to work , though on [2], node() still seems to be the slightly faster alternative (when knowing that elements can only contain at least one non-whitespace character, or are empty otherwise, as is the case with my data). So thanks for making me aware of that performance gain anyway. Many thanks! Best, Ron On 26/04/2023 16:53, Christian Grün wrote: Hi Ron, The proposed option has been added to the latest snapshot [1,2]. In addition, we’ve optimized the evaluation of fn:normalize-space. If it’s applied on element nodes, it will internally be rewritten to a more efficient representation: E[normalize-space()] → E[descendant::text()[normalize-space()]]. Have fun, Christian [1] https://files.basex.org/releases/latest/ [2] https://docs.basex.org/wiki/CSV_Module#Options On Thu, Apr 20, 2023 at 3:58 PM Ron Van den Branden wrote: Hi Christian, As always, many thanks for your lightning-speed help! The update command appears to be way out of my physical memory league, but I'm subscribed to the GitHub issue. Best, Ron On 20/04/2023 14:28, Christian Grün wrote: Hi Ron, I agree that would be helpful. I’ve added a GitHub issue [1]. As you’ve already indicated, you can post-process your databases instances. I think the easiest query for that is: delete nodes db:get('db')//*[empty(node())] …followed by an optional db:optimize('db'). Best, Christian [1] https://github.com/BaseXdb/basex/issues/2203 On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden wrote: Hi all, I'm investigating a way of analysing a massive set of > 900.000 CSV files, for which the CSV parsing in BaseX seems very useful, producing a db nicely filled with documents such as: 3a92-d10e-585e-84a7-29ad17c5799f bbcy:vev:6860 AA 0 some remarks en 3a92-d10e-585e-84a7-29ad17c5799f bbcy:vev:6860 BE 0 concept Yet, when querying those documents, I'm noticing how just selecting non-empty elements is very slow. For example: //source_code[normalize-space()] ...can take over 40 seconds. Since I don't have control over the source data, it would be really great if empty cells could be skipped when parsing CSV files. Of course this could be a trivial post-processing step via XSLT / XQuery, but that's unfeasible for that mass of data. Does BaseX provide a way of telling the CSV parser to skip empty cells? Best, Ron
Re: [basex-talk] skipping empty cells when parsing CSV
Hi Ron, The proposed option has been added to the latest snapshot [1,2]. In addition, we’ve optimized the evaluation of fn:normalize-space. If it’s applied on element nodes, it will internally be rewritten to a more efficient representation: E[normalize-space()] → E[descendant::text()[normalize-space()]]. Have fun, Christian [1] https://files.basex.org/releases/latest/ [2] https://docs.basex.org/wiki/CSV_Module#Options On Thu, Apr 20, 2023 at 3:58 PM Ron Van den Branden wrote: > > Hi Christian, > > As always, many thanks for your lightning-speed help! > > The update command appears to be way out of my physical memory league, > but I'm subscribed to the GitHub issue. > > Best, > > Ron > > On 20/04/2023 14:28, Christian Grün wrote: > > Hi Ron, > > > > I agree that would be helpful. I’ve added a GitHub issue [1]. > > > > As you’ve already indicated, you can post-process your databases > > instances. I think the easiest query for that is: > > > >delete nodes db:get('db')//*[empty(node())] > > > > …followed by an optional db:optimize('db'). > > > > Best, > > Christian > > > > [1] https://github.com/BaseXdb/basex/issues/2203 > > > > > > > > On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden > > wrote: > >> Hi all, > >> > >> I'm investigating a way of analysing a massive set of > 900.000 CSV files, > >> for which the CSV parsing in BaseX seems very useful, producing a db > >> nicely filled with documents such as: > >> > >> > >> > >> 3a92-d10e-585e-84a7-29ad17c5799f > >> bbcy:vev:6860 > >> AA > >> 0 > >> > >> > >> some remarks > >> en > >> > >> > >> > >> > >> > >> 3a92-d10e-585e-84a7-29ad17c5799f > >> bbcy:vev:6860 > >> BE > >> 0 > >> > >> concept > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> Yet, when querying those documents, I'm noticing how just selecting > >> non-empty elements is very slow. For example: > >> > >>//source_code[normalize-space()] > >> > >> ...can take over 40 seconds. > >> > >> Since I don't have control over the source data, it would be really great > >> if empty cells could be skipped when parsing CSV files. Of course this > >> could be a trivial post-processing step via XSLT / XQuery, but that's > >> unfeasible for that mass of data. > >> > >> Does BaseX provide a way of telling the CSV parser to skip empty cells? > >> > >> Best, > >> > >> Ron
Re: [basex-talk] skipping empty cells when parsing CSV
Hi Christian, As always, many thanks for your lightning-speed help! The update command appears to be way out of my physical memory league, but I'm subscribed to the GitHub issue. Best, Ron On 20/04/2023 14:28, Christian Grün wrote: Hi Ron, I agree that would be helpful. I’ve added a GitHub issue [1]. As you’ve already indicated, you can post-process your databases instances. I think the easiest query for that is: delete nodes db:get('db')//*[empty(node())] …followed by an optional db:optimize('db'). Best, Christian [1] https://github.com/BaseXdb/basex/issues/2203 On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden wrote: Hi all, I'm investigating a way of analysing a massive set of > 900.000 CSV files, for which the CSV parsing in BaseX seems very useful, producing a db nicely filled with documents such as: 3a92-d10e-585e-84a7-29ad17c5799f bbcy:vev:6860 AA 0 some remarks en 3a92-d10e-585e-84a7-29ad17c5799f bbcy:vev:6860 BE 0 concept Yet, when querying those documents, I'm noticing how just selecting non-empty elements is very slow. For example: //source_code[normalize-space()] ...can take over 40 seconds. Since I don't have control over the source data, it would be really great if empty cells could be skipped when parsing CSV files. Of course this could be a trivial post-processing step via XSLT / XQuery, but that's unfeasible for that mass of data. Does BaseX provide a way of telling the CSV parser to skip empty cells? Best, Ron
Re: [basex-talk] skipping empty cells when parsing CSV
Hi Ron, I agree that would be helpful. I’ve added a GitHub issue [1]. As you’ve already indicated, you can post-process your databases instances. I think the easiest query for that is: delete nodes db:get('db')//*[empty(node())] …followed by an optional db:optimize('db'). Best, Christian [1] https://github.com/BaseXdb/basex/issues/2203 On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden wrote: > > Hi all, > > I'm investigating a way of analysing a massive set of > 900.000 CSV files, > for which the CSV parsing in BaseX seems very useful, producing a db nicely > filled with documents such as: > > > > 3a92-d10e-585e-84a7-29ad17c5799f > bbcy:vev:6860 > AA > 0 > > > some remarks > en > > > > > > 3a92-d10e-585e-84a7-29ad17c5799f > bbcy:vev:6860 > BE > 0 > > concept > > > > > > > > > > > Yet, when querying those documents, I'm noticing how just selecting non-empty > elements is very slow. For example: > > //source_code[normalize-space()] > > ...can take over 40 seconds. > > Since I don't have control over the source data, it would be really great if > empty cells could be skipped when parsing CSV files. Of course this could be > a trivial post-processing step via XSLT / XQuery, but that's unfeasible for > that mass of data. > > Does BaseX provide a way of telling the CSV parser to skip empty cells? > > Best, > > Ron