Re: [basex-talk] changes in BaseX-10.0 w.r.t. http:send-request() / fetch:doc() functions?

2022-08-05 Thread Ron Van den Branden

Hi,

Thanks for chiming in, Andy! I realized yesterday that I should have 
added how some URLs can be retrieved without problem in BaseX-10.0, e.g.:


  let $uri := 'https://www.w3.org'
  returnhttp:send-request()

...which is well-formed (to rule out non-XML parser issues), and indeed 
has no redirection, which seems consistent with Andy's observation. Yet, 
https://w3.org also is retrieved successfully, which has an initial 301 
response (instead of 303).


Best,

Ron

On 4/08/2022 18:34, Andy Bunce wrote:
There seems to be a 303 redirect. Maybe this is relevant 
https://stackoverflow.com/a/66325588/3210344

/Andy
image.png



On Thu, 4 Aug 2022 at 16:19, Christian Grün 
 wrote:


What I have assessed so far is that it’s the Java Client that fails to
retrieve the result. It’s the same response that’s returned by BaseX.

String uri = "http://vocab.getty.edu/aat/300027473.rdf;;
HttpClient client = HttpClient.newBuilder().build();
HttpRequest request = HttpRequest.newBuilder(URI.create(uri)).build();
BodyHandler handler = HttpResponse.BodyHandlers.ofString();
HttpResponse result = client.send(request, handler);
System.out.println(result.statusCode());
System.out.println(result.body());

400
Apache Tomcat/7.0.42 - Error
report HTTP Status 400 - type Status
reportmessage description The
request sent by the client was syntactically incorrect.Apache Tomcat/7.0.42

So we need to find out why the server thinks the Java request is
»syntactically incorrect«. Maybe we can compare the low-level
representation of the requests with Java 9 and 10 (?).


Re: [basex-talk] changes in BaseX-10.0 w.r.t. http:send-request() / fetch:doc() functions?

2022-08-05 Thread Ron Van den Branden

I'm stunned, thanks so much!

Best,

Ron

On 5/08/2022 11:05, Christian Grün wrote:
This is what we found out (by the help of Wireshark, and some online 
resources):


• The new JDK HTTP Client does not attach a default "Accept" header to 
the HTTP Request.
• The getty.edu <http://getty.edu> web server (Tomcat?) returns a 
syntax error when this header is missing in the request.
• We also had a look at the 303 redirection. It works fine; with BaseX 
10, redirection could even be improved, as protocol changes (http → 
https) are now supported, too.


A new snapshot with a workaround is online [1,2].

Thanks for the observation.
Christian

[1] https://github.com/BaseXdb/basex/issues/2133
[2] https://files.basex.org/releases/latest/



On Fri, Aug 5, 2022 at 9:22 AM Ron Van den Branden 
 wrote:


Hi,

Thanks for chiming in, Andy! I realized yesterday that I should
have added how some URLs can be retrieved without problem in
BaseX-10.0, e.g.:

   let $uri := 'https://www.w3.org'
   returnhttp:send-request(method="get" 
status-only="true" href="{$uri}"/>)

...which is well-formed (to rule out non-XML parser issues), and
indeed has no redirection, which seems consistent with Andy's
observation. Yet, https://w3.org also is retrieved successfully,
which has an initial 301 response (instead of 303).

Best,

Ron

On 4/08/2022 18:34, Andy Bunce wrote:

There seems to be a 303 redirect. Maybe this is relevant
https://stackoverflow.com/a/66325588/3210344
/Andy
image.png



On Thu, 4 Aug 2022 at 16:19, Christian Grün
 wrote:

What I have assessed so far is that it’s the Java Client that
fails to
retrieve the result. It’s the same response that’s returned
by BaseX.

String uri = "http://vocab.getty.edu/aat/300027473.rdf;;
HttpClient client = HttpClient.newBuilder().build();
HttpRequest request =
HttpRequest.newBuilder(URI.create(uri)).build();
BodyHandler handler =
HttpResponse.BodyHandlers.ofString();
HttpResponse result = client.send(request, handler);
System.out.println(result.statusCode());
System.out.println(result.body());

400
Apache Tomcat/7.0.42 - Error
report<!--H1

{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
H2

{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;}
H3

{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
BODY

{font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
B

{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;}
P

{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
{color : black;}A.name {color : black;}HR {color :
#525D76;}--> HTTP Status 400 - type Status
reportmessage description
The
request sent by the client was syntactically
incorrect.Apache
Tomcat/7.0.42

So we need to find out why the server thinks the Java request is
»syntactically incorrect«. Maybe we can compare the low-level
representation of the requests with Java 9 and 10 (?).


[basex-talk] changes in BaseX-10.0 w.r.t. http:send-request() / fetch:doc() functions?

2022-08-04 Thread Ron Van den Branden

Hi,

After upgrading BaseX to 10.0 (yay!), I notice how http:send-request() 
calls that used to work in 9.7 are now failing. I didn't find any 
changes documented between version 9.7 and 10.0 at 
https://docs.basex.org/wiki/HTTP_Client_Module. Yet (tested with the 
same JDK), I'm observing differences between BaseX-9.7 and BaseX-10.0.


Test case:

  let $uri := 'http://vocab.getty.edu/aat/300027473.rdf'
  returnhttp:send-request()

Result:

 * BaseX-9.7: valid response

   http://expath.org/ns/http-client;  status="200" 
message="OK">
  
  
  
  
  
  
  
   

 * BaseX-10.0: "Bad Request" error

   http://expath.org/ns/http-client;  status="400" 
message="Bad Request">
  
  
  
  
  
  
  
   

The same results are obtained with fetch:xml() (BaseX-9.7 - valid 
response) and fetch:doc() (BaseX-10.0 - bad request).


Apologies if I'm overlooking the obvious, but has anything changed 
w.r.t. these http / fetch module functions or their underlying methods 
of network access that would require changes in my XQuery code or 
BaseX-10.0 configuration?


Best,

Ron


Re: [basex-talk] changes in BaseX-10.0 w.r.t. http:send-request() / fetch:doc() functions?

2022-08-04 Thread Ron Van den Branden

Dear Christian,

Whoops, the obvious, after all; thanks for kindly (and lightning fast) 
pointing that out, and looking into this!


Best,

Ron

On 4/08/2022 16:55, Christian Grün wrote:

Dear Ron,

There has indeed been a substantial change in the way how
http:send-request works; it’s now based on the contemporary Java HTTP
Client API, which provides a better overall performance [1]. We’ll
additionally mention that in the article on the HTTP Client Module.

It should yield the same results as the old implementation, though; so
thanks for your example, we’ll see what we can do.

We’ll keep you updated.
Christian

[1] https://docs.basex.org/wiki/BaseX_10#HTTP_Requests


On Thu, Aug 4, 2022 at 4:42 PM Ron Van den Branden
 wrote:

Hi,

After upgrading BaseX to 10.0 (yay!), I notice how http:send-request() calls 
that used to work in 9.7 are now failing. I didn't find any changes documented 
between version 9.7 and 10.0 at https://docs.basex.org/wiki/HTTP_Client_Module. 
Yet (tested with the same JDK), I'm observing differences between BaseX-9.7 and 
BaseX-10.0.

Test case:

   let $uri := 'http://vocab.getty.edu/aat/300027473.rdf'
   return http:send-request()

Result:

BaseX-9.7: valid response

http://expath.org/ns/http-client; status="200" 
message="OK">
   
   
   
   
   
   
   


BaseX-10.0: "Bad Request" error

http://expath.org/ns/http-client; status="400" message="Bad 
Request">
   
   
   
   
   
   
   


The same results are obtained with fetch:xml() (BaseX-9.7 - valid response) and 
fetch:doc() (BaseX-10.0 - bad request).

Apologies if I'm overlooking the obvious, but has anything changed w.r.t. these 
http / fetch module functions or their underlying methods of network access 
that would require changes in my XQuery code or BaseX-10.0 configuration?

Best,

Ron


Re: [basex-talk] skipping empty cells when parsing CSV

2023-04-20 Thread Ron Van den Branden

Hi Christian,

As always, many thanks for your lightning-speed help!

The update command appears to be way out of my physical memory league, 
but I'm subscribed to the GitHub issue.


Best,

Ron

On 20/04/2023 14:28, Christian Grün wrote:

Hi Ron,

I agree that would be helpful. I’ve added a GitHub issue [1].

As you’ve already indicated, you can post-process your databases
instances. I think the easiest query for that is:

   delete nodes db:get('db')//*[empty(node())]

…followed by an optional db:optimize('db').

Best,
Christian

[1] https://github.com/BaseXdb/basex/issues/2203



On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden
 wrote:

Hi all,

I'm investigating a way of analysing a massive set of > 900.000 CSV files, for 
which the CSV parsing in BaseX seems very useful, producing a db nicely filled 
with documents such as:


   
 3a92-d10e-585e-84a7-29ad17c5799f
 bbcy:vev:6860
 AA
 0
 
 
 some remarks
 en
 
 
 
   
   
 3a92-d10e-585e-84a7-29ad17c5799f
 bbcy:vev:6860
 BE
 0
 
 concept
 
 
 
 
 
   

   


Yet, when querying those documents, I'm noticing how just selecting non-empty 
elements is very slow. For example:

   //source_code[normalize-space()]

...can take over 40 seconds.

Since I don't have control over the source data, it would be really great if 
empty cells could be skipped when parsing CSV files. Of course this could be a 
trivial post-processing step via XSLT / XQuery, but that's unfeasible for that 
mass of data.

Does BaseX provide a way of telling the CSV parser to skip empty cells?

Best,

Ron


[basex-talk] skipping empty cells when parsing CSV

2023-04-20 Thread Ron Van den Branden

Hi all,

I'm investigating a way of analysing a massive set of > 900.000 CSV 
files, for which the CSV parsing in BaseX seems very useful, producing a 
db nicely filled with documents such as:



  
3a92-d10e-585e-84a7-29ad17c5799f
bbcy:vev:6860
AA
0


some remarks
en



  
  
3a92-d10e-585e-84a7-29ad17c5799f
bbcy:vev:6860
BE
0

concept





  

  


Yet, when querying those documents, I'm noticing how just selecting 
non-empty elements is very slow. For example:


  //source_code[normalize-space()]

...can take over 40 seconds.

Since I don't have control over the source data, it would be really 
great if empty cells could be skipped when parsing CSV files. Of course 
this could be a trivial post-processing step via XSLT / XQuery, but 
that's unfeasible for that mass of data.


Does BaseX provide a way of telling the CSV parser to skip empty cells?

Best,

Ron


Re: [basex-talk] skipping empty cells when parsing CSV

2023-04-26 Thread Ron Van den Branden

Hi Christian,

Fun it is, two birds with one stone! Both [1] and [2] seem to work , 
though on [2], node() still seems to be the slightly faster alternative 
(when knowing that elements can only contain at least one non-whitespace 
character, or are empty otherwise, as is the case with my data). So 
thanks for making me aware of that performance gain anyway.


Many thanks!

Best,

Ron

On 26/04/2023 16:53, Christian Grün wrote:

Hi Ron,

The proposed option has been added to the latest snapshot [1,2].

In addition, we’ve optimized the evaluation of fn:normalize-space. If
it’s applied on element nodes, it will internally be rewritten to a
more efficient representation: E[normalize-space()] →
E[descendant::text()[normalize-space()]].

Have fun,
Christian

[1] https://files.basex.org/releases/latest/
[2] https://docs.basex.org/wiki/CSV_Module#Options


On Thu, Apr 20, 2023 at 3:58 PM Ron Van den Branden
 wrote:

Hi Christian,

As always, many thanks for your lightning-speed help!

The update command appears to be way out of my physical memory league,
but I'm subscribed to the GitHub issue.

Best,

Ron

On 20/04/2023 14:28, Christian Grün wrote:

Hi Ron,

I agree that would be helpful. I’ve added a GitHub issue [1].

As you’ve already indicated, you can post-process your databases
instances. I think the easiest query for that is:

delete nodes db:get('db')//*[empty(node())]

…followed by an optional db:optimize('db').

Best,
Christian

[1] https://github.com/BaseXdb/basex/issues/2203



On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden
 wrote:

Hi all,

I'm investigating a way of analysing a massive set of > 900.000 CSV files, for 
which the CSV parsing in BaseX seems very useful, producing a db nicely filled 
with documents such as:



  3a92-d10e-585e-84a7-29ad17c5799f
  bbcy:vev:6860
  AA
  0
  
  
  some remarks
  en
  
  
  


  3a92-d10e-585e-84a7-29ad17c5799f
  bbcy:vev:6860
  BE
  0
  
  concept
  
  
  
  
  





Yet, when querying those documents, I'm noticing how just selecting non-empty 
elements is very slow. For example:

//source_code[normalize-space()]

...can take over 40 seconds.

Since I don't have control over the source data, it would be really great if 
empty cells could be skipped when parsing CSV files. Of course this could be a 
trivial post-processing step via XSLT / XQuery, but that's unfeasible for that 
mass of data.

Does BaseX provide a way of telling the CSV parser to skip empty cells?

Best,

Ron