Thank you Sebastian !

Yes, BaseX is an incredible piece of software, reducing development time by 
magnitudes.

The problems I faced were elsewhere :

  *   disruptive technology : pure XML technology is poorly shared among IT 
people, few engineers have a starter level of XPath, XSLT, XQuery. I found that 
most colleagues did not want to improve their skills in that domain, keeping 
with java/jaxb/sax/sql (and sometimes even hibernate...), finding all kinds of 
reasons not to embrace this solution.
  *   not hype (not 'big' data) : excepted MarkLogic (but in an ashamed fashion 
in my opinion), XML is sadly absent from the 'big' data landscape, even though 
we did not wait for big data tools (map/reduce, json...) to handle lots of data 
!
  *   management by the way was reluctant to give that solution a try...
  *   who's that guy that do the entire team's job in one week with a solution 
no one else can maintain ?

So yes I found in love with BaseX and XML too, but even if I had great great 
pleasures, it was (and still is) a kind of secret love, a team and management 
breaker.
I certainly have my part in that situation, with my viceral aversion for things 
like governance, mediocracy...

All the best from french west coast,
Fabrice Etanchaud

________________________________
De : Sebastian Guerrero <chap...@gmail.com>
Envoyé : lundi 18 mai 2020 20:32
À : ETANCHAUD Fabrice <fabrice.etanch...@maif.fr>
Cc : basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Objet : Re: [basex-talk] Full-text index: searches for common words in another 
node. Does it take a lot of time?

Hi Fabrice!

Thanks a lot for your advice. Yes, it's a good idea. And yes, it works.

I created a separated index ( a new database ) for 'mark-identification':

for $db in ('US00','US01','US02')
let $index := <index>{
  for $cases in 
db:open($db)/trademark-applications-daily/application-information/file-segments/action-keys/case-file
  group by $text := $cases/case-file-header/mark-identification
  return
  <text>
    <value>{$text}</value>
    <nodes>
        {for $node in $cases return <id>{ db:node-id($node) }</id>}
    </nodes>
  </text>
}</index>
return db:create($db || '-mark-text', $index, $db || '-mark-text.xml')

Of course with a full-text index for 'value'.

So, to search I use this piece of code:

  let $text := 'corporation'
  for $db in ('US00','US01','US02')
  for $id in ft:search($db || '-mark-text', $text)/ancestor::text/nodes/id
  let $case-file := db:open-id($db, $id)
  return $case-file

And now it only takes 185ms in order to get the results and there is no scan 
for the 'party-name' values.

- "I really appreciated working with basex that time, because others were in a 
kind of java/relational mapping hell... Me, I just had to add xml documents, 
reindex, and sometimes purge deleted items.": Oh dear, I can't explain to you 
how much I'm in love with BaseX right now. Yes, trying to manage this volume of 
data and translate to a SQL database is like a Kafkaesque nightmare, not a 
healthy idea.

Thank you very much!
Cheers,
Sebastian.

On Mon, May 18, 2020 at 12:43 PM ETANCHAUD Fabrice 
<fabrice.etanch...@maif.fr<mailto:fabrice.etanch...@maif.fr>> wrote:
Hi Sebastian,
Yes I think your search on mark-identification suffers from the huge number of 
party-names.
>From what I remember, reverse index (from full text tokens to node ids) is 
>shared across all element's names.
so filtering on the element's name is done at last.

When I was using basex to handle DOCDB patent db, I used to explode a document 
in sub-documents containing only keys and text to be indexed with respect to 
language and xml element, and then build seperate databases.
That way I could create a dedicated full text index on a single (element names, 
language) combination.

Did that help ?

I really appreciated working with basex that time, because others were in a 
kind of java/relational mapping hell... Me, I just had to add xml documents, 
reindex, and sometimes purge deleted items.

Best,
Fabrice

________________________________
De : BaseX-Talk 
<basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de>>
 de la part de Sebastian Guerrero <chap...@gmail.com<mailto:chap...@gmail.com>>
Envoyé : lundi 18 mai 2020 17:23
À : BaseX 
<basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>>
Objet : [basex-talk] Full-text index: searches for common words in another 
node. Does it take a lot of time?

Hi everybody.

I'm here again with my doubts. Thank you for your patience. ^^

I have a database of trademarks with a full-text index for two nodes: 
*:mark-identification,*:party-name. [1]

Where "mark-identification" contains the name of the trademark, and 
"party-name" contains the name of the owner of the trademark.

I use the full-text index in order to search trademarks by its name, for 
example:

for $results in //case-file[case-file-header/mark-identification/text() 
contains text {'basex'}]
return $results//mark-identification

returns all trademarks with "basex" on its name. It works like a thunderlight: 
15ms to get 3 records among 2,134,434,598 nodes. Really a dream. [2]

But, for example, if I change the searched text from "basex" by a common word 
in "party-name", for example, "corporation" ( has 1096187x occurrences on the 
full-text index as showed in [1], it's a very common word in owners of 
trademarks ):

for $results in //case-file[case-file-header/mark-identification/text() 
contains text {'corporation'}]
return $results//mark-identification

It takes a long time to get 6,715 records: 62,000ms [3]

If I search for "live" ( a common word for trademarks name, but not for owners 
names ) I get 5,875 records in 2,773 ms, which has not a relationship with the 
62,000ms to get the 6k records for "corporation". [4]

So...

  *   Is this an expected behaviour?
  *   Is there a way to specify which "section" of the full-text index should 
be used to perform the search? ( I don't know... maybe something similar to 
"using stemming" but "using index 'mark-identification'" )

Please apologize me if I'm asking by something not-logical,

Best regards,
Sebastian

[1] https://imgur.com/uLla1Xt
[2] https://imgur.com/Fkcvv2O
[3] https://imgur.com/Hk71CNe
[4] https://imgur.com/P72k574

Reply via email to