Hi Fabrice!

Thanks a lot for your advice. Yes, it's a good idea. And yes, it works.

I created a separated index ( a new database ) for '*mark-identification*':

*for $db in ('US00','US01','US02')*
*let $index := <index>{*
*  for $cases in
db:open($db)/trademark-applications-daily/application-information/file-segments/action-keys/case-file*
*  group by $text := $cases/case-file-header/mark-identification*
*  return*
*  <text>*
*    <value>{$text}</value>*
*    <nodes>*
*        {for $node in $cases return <id>{ db:node-id($node) }</id>}*
*    </nodes>*
*  </text>*
*}</index>*
*return db:create($db || '-mark-text', $index, $db || '-mark-text.xml')*


Of course with a full-text index for '*value*'.

So, to search I use this piece of code:

  let $text := 'corporation'
  for $db in ('US00','US01','US02')
  for $id in ft:search($db || '-mark-text', $text)/ancestor::text/nodes/id
  let $case-file := db:open-id($db, $id)
  return $case-file


And now it only takes 185ms in order to get the results and there is no
scan for the '*party-name*' values.

*- "**I really appreciated working with basex that time, because others
were in a kind of java/relational mapping hell... Me, I just had to add xml
documents, reindex, and sometimes purge deleted items."*: Oh dear, I can't
explain to you how much I'm in love with BaseX right now. Yes, trying to
manage this volume of data and translate to a SQL database is like a
Kafkaesque nightmare, not a healthy idea.

Thank you very much!
Cheers,
Sebastian.

On Mon, May 18, 2020 at 12:43 PM ETANCHAUD Fabrice <
fabrice.etanch...@maif.fr> wrote:

> Hi Sebastian,
> Yes I think your search on mark-identification suffers from the huge
> number of party-names.
> From what I remember, reverse index (from full text tokens to node ids) is
> shared across all element's names.
> so filtering on the element's name is done at last.
>
> When I was using basex to handle DOCDB patent db, I used to explode a
> document in sub-documents containing only keys and text to be indexed with
> respect to language and xml element, and then build seperate databases.
> That way I could create a dedicated full text index on a single (element
> names, language) combination.
>
> Did that help ?
>
> I really appreciated working with basex that time, because others were in
> a kind of java/relational mapping hell... Me, I just had to add xml
> documents, reindex, and sometimes purge deleted items.
>
> Best,
> Fabrice
>
> ------------------------------
> *De :* BaseX-Talk <basex-talk-boun...@mailman.uni-konstanz.de> de la part
> de Sebastian Guerrero <chap...@gmail.com>
> *Envoyé :* lundi 18 mai 2020 17:23
> *À :* BaseX <basex-talk@mailman.uni-konstanz.de>
> *Objet :* [basex-talk] Full-text index: searches for common words in
> another node. Does it take a lot of time?
>
> Hi everybody.
>
> I'm here again with my doubts. Thank you for your patience. ^^
>
> I have a database of trademarks with a full-text index for two nodes:
> **:mark-identification,*:party-name*. [1]
>
> Where "*mark-identification*" contains the name of the trademark, and "
> *party-name*" contains the name of the owner of the trademark.
>
> I use the full-text index in order to search trademarks by its name,
> for example:
>
> *for $results in //case-file[case-file-header/mark-identification/text()
> contains text {'basex'}]*
> *return $results//mark-identification*
>
>
> returns all trademarks with "*basex*" on its name. It works like a
> thunderlight: 15ms to get 3 records among 2,134,434,598 nodes. Really a
> dream. [2]
>
> But, for example, if I change the searched text from "*basex*" by a
> common word in "*party-name*", for example, "*corporation*" ( has
> 1096187x occurrences on the full-text index as showed in [1], it's a very
> common word in owners of trademarks ):
>
> *for $results in //case-file[case-file-header/mark-identification/text()
> contains text {'corporation'}]*
> *return $results//mark-identification*
>
>
> It takes a long time to get 6,715 records: 62,000ms [3]
>
> If I search for "*live*" ( a common word for trademarks name, but not for
> owners names ) I get 5,875 records in 2,773 ms, which has not a
> relationship with the 62,000ms to get the 6k records for "*corporation*".
> [4]
>
> So...
>
>    - Is this an expected behaviour?
>    - Is there a way to specify which "section" of the full-text index
>    should be used to perform the search? ( I don't know... maybe something
>    similar to "*using stemming*" but "*using index 'mark-identification'*"
>    )
>
> Please apologize me if I'm asking by something not-logical,
>
> Best regards,
> Sebastian
>
> [1] https://imgur.com/uLla1Xt
> [2] https://imgur.com/Fkcvv2O
> [3] https://imgur.com/Hk71CNe
> [4] https://imgur.com/P72k574
>
>

Reply via email to