Good to hear from you, Jorge.  Thanks not only for the pointer but also for all 
the work you did on your StratML prototype, as documented on GitHub.  It helped 
get us to this point and I'll look forward to any further contributions you may 
be able to make.
Naval, in the reference Jorge cites, here's the text that appears to be 
relevant:

By default, unless the languages codes ja, ar, ko, th, or zh are specified, a 
tokenizer for Western texts is used:
Whitespaces are interpreted as token delimiters.

The following is contrary to the intent of the StratML query service:

Since the logical flow of the text is not interrupted by the child elements, 
you will typically want to search across elements, so that the above paragraph 
would match a search for “real text”. For more examples, see XQuery and XPath 
Full Text 1.0 Use Cases.

The query service SHOULD respect each element as being distinct.  The purpose 
of the service is to enable discrete querying of the elements of the schema, 
and within each element whitepace should be treated as a delimiter.  However, 
this guidance is confusing:

To enable this kind of searches, it is recommendable to:
Keep whitespace stripping turned off when importing XML documents. This can be 
done by ensuring that STRIPWS is disabled. This can also be done in the GUI if 
a new database is created (Database → New… → Parsing → Strip Whitespaces).

The first two sentences seem to suggest that whitespaces will be maintained 
while the third indicates they would be removed.
While this may not be the most important next step to be taken to improve and 
enhance https://search.aboutthem.info/, it might be one of the easiest.
Owen Amburhttps://www.linkedin.com/in/owenambur/
 

    On Wednesday, April 5, 2023 at 03:28:57 AM EDT, <jo...@vionta.net> wrote:  
 
   

 
 
Hi Owen, 
 
 
You may check the full text configuration cappabilities  
https://docs.basex.org/wiki/Full-Text like possitional filters and Fuzzy 
Quering. It may be a bug, but I would exclude configuration at first.
 
I can see that you are making good progresses, and love that you have taken the 
basex option. I think that you are on the right path. 
 
 
Love to see progresses. 
 
 
Kind regards. 
 
 

 
 

 On 08/03/2023 17:31, Owen Ambur wrote:
  
   Christian, do you know if this has been identified as a bug in BaseX's 
full-text query capability and, if so, if there are any plans to do anything 
about it? 
  If memory serves me correctly, I subscribed to the BaseX listserv for awhile 
to try to enlist a developer(s) for a StratML-enabled query service, like the 
one on which Naval is now working for me for hosting at 
https://aboutthem.info/. 
  When the query service is in relatively good shape, I may wish to resubscribe 
to the listserv to announce it there as well as on LinkedIn and perhaps 
elsewhere.  However, do you think it might be worthwhile to raise this issue on 
the listserv in the meantime? 
    Owen Ambur https://www.linkedin.com/in/owenambur/
      
  
      On Tuesday, March 7, 2023 at 03:16:49 PM EST, Naval Sarda 
<nsa...@epicomm.net> wrote:  
  
     
Hi Owen.
 
The inbuild search provided by BaseX is combining the text from next file and 
then searching.
 
So if the line ends with word "end." and next line starts with "less", it will 
match search criteria "endless"
 
This is false positive matching. There is nothing much we can do about it as 
replacing with custom search will be slow.
 
Naval
 
  On 07/03/23 6:38 am, Owen Ambur wrote:
  
 
       What can we do about it? 
    Owen Ambur https://www.linkedin.com/in/owenambur/
      
  
      On Monday, March 6, 2023 at 07:16:17 PM EST, Naval Sarda 
<nsa...@epicomm.net> wrote:  
  
     

 
 Please see below
 
 -------- Forwarded Message -------- 
| Subject:  | Re: Fwd: False Positives |
| Date:  | Mon, 6 Mar 2023 21:38:43 +0530 |
| From:  | Sudarshana <sudarsha...@epicomm.net> |
| To:  | Naval Sarda <nsa...@epicomm.net>, jitend...@epicomm.net |

 
 
 
Owen,
 
This was known issue we were informed you. 
 
 
In fulltext search, if there is any space character like (tab, space or new 
line) is present then it is coming in result. 
 
 
In file APQC.xml, Board of Governors of the Federal Reserve System is one 
organization and Bombardier Aerospace Inc. is next adjacent organization. 
 
 
So Board of Governors of the Federal Reserve System Bombardier Aerospace Inc. 
highlighted keyword is considering as tembom .
 
So those files are coming in result.
 
-Sudarshana
 On 3/6/2023 10:18 AM, Naval Sarda wrote:
  
      
   
  Get Outlook for iOS     From: Owen Ambur <owen.am...@verizon.net>
 Sent: Monday, March 6, 2023 6:35 AM
 To: Naval Sarda <nsa...@epicomm.net>
 Cc: abouttheminfop...@googlegroups.com <abouttheminfop...@googlegroups.com>
 Subject: False Positives      Naval, Ken Holman's LinkedIn posting about his 
health issue prompted me to query to confirm that Project TEMBO's about 
statement is in the StratML collection. 
  However, the full-text query also revealed a couple apparently false 
positives, as shown in the screen shot below.  They are:   
   
  https://stratml.us/docs/APQC.xml    https://stratml.us/docs/DOSAID2022.xml  
  
  Since the latter is tangentially related in terms of foreign aid, it might be 
logical for an AI-enhanced query service to reveal it as such.  However, ours 
isn't that "intelligent," is it? 
  What do you suppose might account for the false positives?  This isn't the 
first time I've encountered them. 
    Owen Ambur https://www.linkedin.com/in/owenambur/
  
  
    
        
 -- 
 Thanks & Regards
 Sudarshana             
   

Reply via email to