Re: [basex-talk] default concatenation of strings? BaseX 10.5

2023-04-05 Thread Patrick Durusau

Liam,

Thanks!

Yes, in the response window and I didn't know it used "adaptive" 
serialization.


No promises but I hope to remember that!

Hope you are having a great week!

Patrick


On 4/5/23 16:25, Liam R. E. Quin wrote:

On Wed, 2023-04-05 at 16:01 -0400, Patrick Durusau wrote:

Greetings!

I'm converting Hebrew text, word by word, into code points, which is
returned as:

1493
1463
1497

etc

When you say returned as, i am guessing you mean that's what shows up
in the BaseX "results" window, which uses "adaptive" serialization.

You could use string-join(your query here, ' ') of course, to make a
single string;
in that window sequences are shown one item per line.


liam


--
Patrick Durusau
patr...@durusau.net
Technical Advisory Board, OASIS (TAB)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau



OpenPGP_signature
Description: OpenPGP digital signature


Re: [basex-talk] default concatenation of strings? BaseX 10.5

2023-04-05 Thread Liam R. E. Quin
On Wed, 2023-04-05 at 16:01 -0400, Patrick Durusau wrote:
> Greetings!
> 
> I'm converting Hebrew text, word by word, into code points, which is 
> returned as:
> 
> 1493
> 1463
> 1497

etc

When you say returned as, i am guessing you mean that's what shows up
in the BaseX "results" window, which uses "adaptive" serialization.

You could use string-join(your query here, ' ') of course, to make a
single string;
in that window sequences are shown one item per line.


liam

-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org


[basex-talk] default concatenation of strings? BaseX 10.5

2023-04-05 Thread Patrick Durusau

Greetings!

I'm converting Hebrew text, word by word, into code points, which is 
returned as:


1493
1463
1497
1468
1463
1513
1473
1456
1499
1468
1461
1448
1501

1500
1464
1489
1464
1436
1503

 (the file is quite long)

What I expect is described at: 
https://www.w3.org/TR/xslt-xquery-serialization/#sequence-normalization


"If the |item-separator| serialization parameter is absent, then for 
each subsequence of adjacent strings in /S_2 /, copy a single string to 
the new sequence equal to the values of the strings in the subsequence 
concatenated in order, each separated by a single space."


I maybe very wrong but shouldn't that render the strings as?:

1493 1463 1497 1468 1463 1513 1473 1456 1499 1468 1461 1448 1501

and,

1500 1464 1489 1464 1436 1503

I've tried using replace($a, "\n", " ") but it complains that $a is a 
sequence, which it is.


Then I tried:

for $char in $a

return ($char, " ")

Now I get:

1493


1463


1497

etc.

I saw the new line settings under serialization but there didn't appear 
to be any way to defeat them altogether.


Thanks!

Patrick

--
Patrick Durusau
patr...@durusau.net
Technical Advisory Board, OASIS (TAB)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau



OpenPGP_signature
Description: OpenPGP digital signature


Re: [basex-talk] False Positives

2023-04-05 Thread Owen Ambur
Good to hear from you, Jorge.  Thanks not only for the pointer but also for all 
the work you did on your StratML prototype, as documented on GitHub.  It helped 
get us to this point and I'll look forward to any further contributions you may 
be able to make.
Naval, in the reference Jorge cites, here's the text that appears to be 
relevant:

By default, unless the languages codes ja, ar, ko, th, or zh are specified, a 
tokenizer for Western texts is used:
Whitespaces are interpreted as token delimiters.

The following is contrary to the intent of the StratML query service:

Since the logical flow of the text is not interrupted by the child elements, 
you will typically want to search across elements, so that the above paragraph 
would match a search for “real text”. For more examples, see XQuery and XPath 
Full Text 1.0 Use Cases.

The query service SHOULD respect each element as being distinct.  The purpose 
of the service is to enable discrete querying of the elements of the schema, 
and within each element whitepace should be treated as a delimiter.  However, 
this guidance is confusing:

To enable this kind of searches, it is recommendable to:
Keep whitespace stripping turned off when importing XML documents. This can be 
done by ensuring that STRIPWS is disabled. This can also be done in the GUI if 
a new database is created (Database → New… → Parsing → Strip Whitespaces).

The first two sentences seem to suggest that whitespaces will be maintained 
while the third indicates they would be removed.
While this may not be the most important next step to be taken to improve and 
enhance https://search.aboutthem.info/, it might be one of the easiest.
Owen Amburhttps://www.linkedin.com/in/owenambur/
 

On Wednesday, April 5, 2023 at 03:28:57 AM EDT,  wrote:  
 
   

 
 
Hi Owen, 
 
 
You may check the full text configuration cappabilities  
https://docs.basex.org/wiki/Full-Text like possitional filters and Fuzzy 
Quering. It may be a bug, but I would exclude configuration at first.
 
I can see that you are making good progresses, and love that you have taken the 
basex option. I think that you are on the right path. 
 
 
Love to see progresses. 
 
 
Kind regards. 
 
 

 
 

 On 08/03/2023 17:31, Owen Ambur wrote:
  
   Christian, do you know if this has been identified as a bug in BaseX's 
full-text query capability and, if so, if there are any plans to do anything 
about it? 
  If memory serves me correctly, I subscribed to the BaseX listserv for awhile 
to try to enlist a developer(s) for a StratML-enabled query service, like the 
one on which Naval is now working for me for hosting at 
https://aboutthem.info/. 
  When the query service is in relatively good shape, I may wish to resubscribe 
to the listserv to announce it there as well as on LinkedIn and perhaps 
elsewhere.  However, do you think it might be worthwhile to raise this issue on 
the listserv in the meantime? 
Owen Ambur https://www.linkedin.com/in/owenambur/
  
  
  On Tuesday, March 7, 2023 at 03:16:49 PM EST, Naval Sarda 
 wrote:  
  
 
Hi Owen.
 
The inbuild search provided by BaseX is combining the text from next file and 
then searching.
 
So if the line ends with word "end." and next line starts with "less", it will 
match search criteria "endless"
 
This is false positive matching. There is nothing much we can do about it as 
replacing with custom search will be slow.
 
Naval
 
  On 07/03/23 6:38 am, Owen Ambur wrote:
  
 
   What can we do about it? 
Owen Ambur https://www.linkedin.com/in/owenambur/
  
  
  On Monday, March 6, 2023 at 07:16:17 PM EST, Naval Sarda 
 wrote:  
  
 

 
 Please see below
 
  Forwarded Message  
| Subject:  | Re: Fwd: False Positives |
| Date:  | Mon, 6 Mar 2023 21:38:43 +0530 |
| From:  | Sudarshana  |
| To:  | Naval Sarda , jitend...@epicomm.net |

 
 
 
Owen,
 
This was known issue we were informed you. 
 
 
In fulltext search, if there is any space character like (tab, space or new 
line) is present then it is coming in result. 
 
 
In file APQC.xml, Board of Governors of the Federal Reserve System is one 
organization and Bombardier Aerospace Inc. is next adjacent organization. 
 
 
So Board of Governors of the Federal Reserve System Bombardier Aerospace Inc. 
highlighted keyword is considering as tembom .
 
So those files are coming in result.
 
-Sudarshana
 On 3/6/2023 10:18 AM, Naval Sarda wrote:
  
  
   
  Get Outlook for iOS     From: Owen Ambur 
 Sent: Monday, March 6, 2023 6:35 AM
 To: Naval Sarda 
 Cc: abouttheminfop...@googlegroups.com 
 Subject: False Positives      Naval, Ken Holman's LinkedIn posting about his 
health issue prompted me to query to confirm that Project TEMBO's about 
statement is in the StratML collection. 
  However, the full-text query also revealed a couple apparently false 
positives, as shown in the screen shot below.  They are:   
   
  https://stratml.us/docs/APQC.xml    https://stratml.us/docs/DOSAID2022.x

Re: [basex-talk] Constructing "resolved" DITA Map in XQuery: How to Avoid High Memory usage?

2023-04-05 Thread Eliot Kimber
I don’t think it’s multiple inclusion of the same resource, although that is a 
possibility with our content (although one I’ve worked hard to eliminate in the 
latest updates to our content set).

At least based on logging shown in the GUI, the failure happens long before the 
possibility of encountering a possibly multiply-included submap, for example ( 
which would be the case that might result in a loop).

To answer Liam’s question about just storing the resolved map, the challeng is 
that the resolved map needs to reflect the node IDs of the original elements, 
so pregenerating it won’t work without some way to then correlate the resolved 
map elements to their corresponding elements in the original source.

As I’ve thought about it more I think a process that walks the map tree and 
constructs XQuery maps is the best solution.

Cheers,

E.

_
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com
LinkedIn | 
Twitter | 
YouTube | 
Facebook

From: Hans-Juergen Rennau 
Date: Wednesday, April 5, 2023 at 2:30 AM
To: basex-talk@mailman.uni-konstanz.de , 
Eliot Kimber 
Subject: Re: [basex-talk] Constructing "resolved" DITA Map in XQuery: How to 
Avoid High Memory usage?
[External Email]


Greetings, Eliot,

could it be that the problem arises from repeated inclusion of one and the same 
resource, which is referenced by different resources? You might check this by 
determining the cumulative size of the resources to be potentially included. Is 
it really >1 GB?

Even if you use a recursive function receiving as a parameter the resources 
already processed and suppress the processing of a resource found among them, 
like so

declare function f:resolve($node, $alreadyFound) {
if ($node intersect $alreadyFound) then () else
...
... f:resolve($child, ($node, alreadyFound))
...
}

this does prevent circular inclusion, but may not be sufficient to prevent a 
combinatorial explosion. The explosion may occur if you process siblings in a 
straightforward way, so that the result of resolving one element is not fed 
into the processing of the following siblings, like so:

declare function f:resolve($node, $alreadyFound) {
...
... $node/*/f:resolve($child, ($node, alreadyFound))
...
}

To avoid combinatorialexplosion I suggest a method which I call "total 
recursion", in which each invocation of the recursive function processes only 
one node, traversing siblings recursively. (If relevant, details on demand.)

Kind regards,
Hans-Jürgen

Am Mittwoch, 5. April 2023 um 01:13:10 MESZ hat Eliot Kimber 
 Folgendes geschrieben:



I’m implementing a feature of DITA which involves pulling together all the DITA 
maps and submaps linked from a root map so that you can then process them as a 
single unit in order to then construct “key spaces”, which are defined by the 
topicrefs contained in the maps and which depend on both the structural 
hierarchy defined by the tree of maps and submaps and on the markup details of 
both the maps and the topicref elements. It’s a challenging bit of data 
processing.



In other contexts where I’ve implemented this processing I start by creating a 
“resolved map” using a relatively simple transform, resulting in a single XML 
document with all the stuff needed to then construct the DITA key space. With 
the resolved map, the logic to construct the key space is a relatively simple 
three-phase process.



My naïve attempt to do this in BaseX using the normal typeswitch approach to 
implement an identity transform worked at a small scale, but for our real DITA 
maps, which have 10s of 1000s of elements, the process quickly exhausts the 2GB 
of RAM allocated to BaseX GUI.



The reason I’m doing the transform in XQuery and not just using Saxon via the 
XSLT module is because I need to annotate the resulting resolved map with the 
database node IDs of each element so I can then capture those details in the 
final key space, which I’m storing as XML in another database—the constructed 
key space acts as an index where the input is a context element/key name pair 
and the result is the topicref element that defines the key, from which I can 
then get the resource associated with that topicref (i.e., the topic it 
references or a string it defines or whatever it might be).



So my first question is: Is there a general technique for doing this kind of 
identity transform that won’t blow up the memory? I suspect the answer is “no” 
but figured I’d ask.



Or is it possible to apply a Saxon transform to content pulled from the BaseX 
database and have access to the node IDs? I didn’t immediately see a way that 
you could do that.  There must be a pretty sharp separation between BaseX and 

Re: [basex-talk] XQuery inconsistent result related to overflow

2023-04-05 Thread Christian Grün
Hi Shuxin,

Thanks for the new test case. I think you got it right, it looks like
a corner case, which is caused by an internal optimization. It can be
explained via the following expression, which yields "false" and
"true" (while one might expect "false" and "false"):

0 * 3780298429396748056 = 1,
0 = 1 div 3780298429396748056

As 1 div 3780298429396748056 is defined to return the decimal value 0,
the result will be changed if the right-hand operand of the
multiplication is moved to the right-hand side of the comparison –
which is precisely what our optimizer does when compiling your test
expression:

//A1[((count(./*) idiv 3) * -1763118392 * 2144097893) != 1]

We decided to do so because it works out fine in all practical use
cases we encountered so far.

But thanks anyway for your feedback! Maybe we can refine our processor
to suppress optimizations for which we can statically detect that the
result would change.

Best regards, and looking forward to your next reports,
Christian


[basex-talk] XQuery inconsistent result related to overflow

2023-04-05 Thread Shuxin Li
Hi,

I'm Shuxin. Recently I came across this test case in which BaseX returned
an incorrect result. Currently I could not confirm it as a bug and is only
a corner case which might not be of importance, but there are some
interesting behaviors which I'm also curious about.

Give XML document



and XPath query

//A1[((count(./*) idiv 3) * -1763118392 * 2144097893) != 1]

BaseX return empty result set while node A1 should be returned. This should
be related to overflow since making either number smaller BaseX will then
return correct result and therefore could be considered not a bug. But
interestingly, this only occurs when the operand given after idiv (in this
case 3) is greater than 2, which result is 0. If changed to (count(./*)
idiv 2) or  (count(./*) idiv 1), the operand is larger but correct results
are still returned. Simply substituting the whole (count(./*) idiv 3) to
constant either 0, 1, 2 all returns correct results. This indicates that
BaseX should have the ability of handling this expression, but is somehow
affected. The version of BaseX I produced this on is the latest development
commit e607ecc.

Still, this is just a corner case which might not require a fix. I submit
this report just for your reference. Thank you very much!

Best Regards,
Shuxin Li
2023.4.5


Re: [basex-talk] Constructing "resolved" DITA Map in XQuery: How to Avoid High Memory usage?

2023-04-05 Thread Hans-Juergen Rennau
 Greetings, Eliot,
could it be that the problem arises from repeated inclusion of one and the same 
resource, which is referenced by different resources? You might check this by 
determining the cumulative size of the resources to be potentially included. Is 
it really >1 GB?
Even if you use a recursive function receiving as a parameter the resources 
already processed and suppress the processing of a resource found among them, 
like so
declare function f:resolve($node, $alreadyFound) {    if ($node intersect 
$alreadyFound) then () else     ...    ... f:resolve($child, ($node, 
alreadyFound))    ...}
this does prevent circular inclusion, but may not be sufficient to prevent a 
combinatorial explosion. The explosion may occur if you process siblings in a 
straightforward way, so that the result of resolving one element is not fed 
into the processing of the following siblings, like so:
declare function f:resolve($node, $alreadyFound) {    ...     ... 
$node/*/f:resolve($child, ($node, alreadyFound))    ...}
To avoid combinatorialexplosion I suggest a method which I call "total 
recursion", in which each invocation of the recursive function processes only 
one node, traversing siblings recursively. (If relevant, details on demand.)
Kind regards,Hans-Jürgen
Am Mittwoch, 5. April 2023 um 01:13:10 MESZ hat Eliot Kimber 
 Folgendes geschrieben:  
 
  
I’m implementing a feature of DITA which involves pulling together all the DITA 
maps and submaps linked from a root map so that you can then process them as a 
single unit in order to then construct “key spaces”, which are defined by the 
topicrefs contained in the maps and which depend on both the structural 
hierarchy defined by the tree of maps and submaps and on the markup details of 
both the maps and the topicref elements. It’s a challenging bit of data 
processing.
 
  
 
In other contexts where I’ve implemented this processing I start by creating a 
“resolved map” using a relatively simple transform, resulting in a single XML 
document with all the stuff needed to then construct the DITA key space. With 
the resolved map, the logic to construct the key space is a relatively simple 
three-phase process.
 
  
 
My naïve attempt to do this in BaseX using the normal typeswitch approach to 
implement an identity transform worked at a small scale, but for our real DITA 
maps, which have 10s of 1000s of elements, the process quickly exhausts the 2GB 
of RAM allocated to BaseX GUI. 
 
  
 
The reason I’m doing the transform in XQuery and not just using Saxon via the 
XSLT module is because I need to annotate the resulting resolved map with the 
database node IDs of each element so I can then capture those details in the 
final key space, which I’m storing as XML in another database—the constructed 
key space acts as an index where the input is a context element/key name pair 
and the result is the topicref element that defines the key, from which I can 
then get the resource associated with that topicref (i.e., the topic it 
references or a string it defines or whatever it might be).
 
  
 
So my first question is: Is there a general technique for doing this kind of 
identity transform that won’t blow up the memory? I suspect the answer is “no” 
but figured I’d ask.
 
  
 
Or is it possible to apply a Saxon transform to content pulled from the BaseX 
database and have access to the node IDs? I didn’t immediately see a way that 
you could do that.  There must be a pretty sharp separation between BaseX and 
Saxon here, but again, maybe I missed something?
 
  
 
If the answer is “no” I can work out a more sophisticated way to build the 
initial data from which the key space is ultimately constructed by walking the 
map tree and populating XQuery maps or something, but I was hoping to keep my 
simple code that just operates on the resolved map.
 
  
 
Thanks,
 
  
 
Eliot
 
_
 
Eliot Kimber
 
Sr Staff Content Engineer
 
O: 512 554 9368
 
M: 512 554 9368
 
servicenow.com
 
LinkedIn | Twitter | YouTube | Facebook