Re: [MarkLogic Dev General] What's a best approach to search unstructured docs for numbers Greater Than N?

Gary Russo Mon, 10 Nov 2014 07:31:52 -0800

Thanks Gary, David, and Justin for the suggestions.


I wound up using this path-range-index. => /tax:origin/tax:feed/tax:price//*

 

The query response time is now 100X faster than the brute force XQuery approach.

 

I tested this using a database with 100,000 “random schema” documents.

 

Response time without path-range-index is 20 seconds.

Response time with path-range-index is .2 seconds.

 

The following sample doc shows the /tax:price//* node that contains the “random 
schema” structure.

 

Sample Doc:

 

<origin xmlns="http://tax.thomsonreuters.com";>
  <meta>
    <id>211</id>
    <importFileId>1</importFileId>
    <importedUnitCode>RU00211</importedUnitCode>
    <importedAccountCode>AC00790</importedAccountCode>
    <beginningBalance>0</beginningBalance>
    <endingBalance>8743.65477764218</endingBalance>
    <type>ledger</type>
  </meta>
  <feed>
    <price>
      <price78>462774.37</price78>
      <price0>554495.25</price0>
      <price1>911655.28</price1>
      <price2>687320.83</price2>
      <price30>680451.3</price30>
      <balances>
        <beginningBalance>0</beginningBalance>
        <endingBalance>8743.65477764218</endingBalance>
        <trialBalance>
          <trialBalance1>500872.92</trialBalance1>
          <trialBalance2>455478.49</trialBalance2>
        </trialBalance>
      </balances>
    </price>
    <body>pending</body>
  </feed>
</origin>



 

The following code snippet shows the function used. It creates a response doc 
that contains the found values with respective uri and xpath.

 

Path-Range-Index Code Snippet:

 

declare function tr:getValuesWithinRange($min as xs:decimal, $max as 
xs:decimal) as node()*
{
  let $values  :=
    for $val in cts:values(cts:path-reference($pathIdx))
      where $val ge $min and $val le $max
        return
          $val
  
  let $valuesDoc :=
    element { "valueItems" }
    {
      for $value in $values
      
        let $results := cts:search(/tax:origin/tax:feed/tax:price, 
cts:word-query(xs:string($value)))
        
        return
          element { "valueItem" }
          {
            (
              element { "value" } { $value },
              element { "count" } { fn:count($results) },
              element { "items" }
              {
                for $item in $results//*/text()
                  let $path  := xdmp:path($item/..)
                    where $item eq $value
                      return
                        element { "item" }
                        {
                          element { "uri" }   { xdmp:node-uri($item) },
                          element { "path" }  { $path }
                        }
               }
            )
          }
    }

  let $response :=
    element { "response" }
    {
      element { "input" }
      {
        element { "min" }            { $min },
        element { "max" }            { $max }
      },
      element { "elapsedTime" }      { xdmp:elapsed-time() },
      element { "uniqueValueCount" } { fn:count($values) },
      element { "values" }           { fn:string-join(xs:string($values), " ") 
},
      $valuesDoc
    }

  return $response
};



 

 

 

 

 

From: [email protected] 
[mailto:[email protected]] On Behalf Of David Ennis
Sent: Wednesday, October 29, 2014 4:56 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] What's a best approach to search 
unstructured docs for numbers Greater Than N?

 

HI.

 

Related to Justin's approach:

 

1) I would start this by using a field with a path defined (and appropriate 
indexes).

This would allow you to more easily tune where you look for the values over 
time by adding more xpath expressions and using the exclude options.   But see 
below - and the reason Justin was likely showing xPath expressions that match 
elements that are values that can be cast properly.

 

2) However, I believe that anything related to indexes still requires that the 
element you index do not have mixed content.  So, the generic approach above 
without tuning to target elements that you know can be cast into the correct 
datatype (integer, decimal, etc) will result in errors like this:  Invalid 
cast: xs:untypedAtomic("250 9996.26426957374 Public Law 2012-02-14 Redesignates 
the Noxu...") cast as xs:integer

So, if you really need to look in each and every element of mixed content and 
want to get this via an index, I have a few suggestions:

 

1) Transform the documents and add the decimal value as an attribute.  This is 
a relatively trivial recursive function that tests if an element's value can be 
cast as a number and then adds the value as an attribute.  The test (taken from 
FunxtX)is simply string(number($value)) != 'NaN'.

 

Your transformed document might look like this:

<doc>

  <prices>

    <beginningBalance decimal-value='250'>250</beginningBalance>

    <endingBalance 
decimal-value='9996.26426957374'>9996.26426957374</endingBalance>

  </prices>

  <summary>

    <summary-as>Public Law</summary-as>

    <summary-date>2012-02-14</summary-date>

    <summary-text>Redesignates the Noxubee National Wildlife 
Refuge.....</summary-text>

  </summary>

  <bankCharge decimal-value='1500.75'>1500.75</bankCharge>

</doc> 

Then you only need to index the attribute decimal-value. And this attribute 
would only exist for values that are truly interesting.

 

OR 2) Don't want to alter your document - store these values in the properties 
fragment:

Example in properties and then index the decimal-values element(s):

<interesting-values>

  <decimal-value>250</decimal-value>

  <decimal-value>9996.26426957374</decimal-value>

  <decimal-value>1500.75</decimal-value>

</interesting-values>

 

for this option, it is also possible to store the original path if you really 
need to keep the context of the numbers.  I would store this and any other 
interesting information as attributes of the decimal-value elements:

<decimal-value xpath='/doc/prices/endingBalance' 
original-element-name='endingBalance'>9996.26426957374</decimal-value>

 

This last format truly gives you flexability when mixed with the power of 
fields as you can start to target and prioritize specific elements if needed. 

 

 

 

Kind Regards,

David Ennis

 

 




 

 

Kind Regards,

David Ennis

 

 

David Ennis
Content Engineer

 <http://www.hinttech.com/> HintTech 
Mastering the value of content
creative | technology | content

Delftechpark 37i
2628 XJ Delft
The Netherlands
T: +31 88 268 25 00
M: +31 63 091 72 80 

 <http://www.hinttech.com> http://www.hinttech.com  
<https://twitter.com/HintTech>   <http://www.facebook.com/HintTech>   
<http://www.linkedin.com/company/HintTech> 

 

On 29 October 2014 00:15, Gary Russo <[email protected]> wrote:

I have a database that consists of highly unstructured documents. The documents 
contain pricing data.

My requirement is to search all nodes of every document to find a number that 
is greater than 1,000.

The element names containing the numbers can be anything.

Example Docs:

Doc 1

<doc>
  <prices>
    <beginningBalance>250</beginningBalance>
    <endingBalance>9996.26426957374</endingBalance>
  </prices>
  <summary>
    <summary-as>Public Law</summary-as>
    <summary-date>2012-02-14</summary-date>
    <summary-text>Redesignates the Noxubee National Wildlife 
Refuge.....</summary-text>
  </summary>
  <bankCharge>1500.75</bankCharge>
</doc>

Doc 2

<doc>
  <cost>
    <startBalance>250</startBalance>
    <endBalance>9996.26426957374</endBalance>
  </cost>
  <summary>
    <summary-as>Public Law</summary-as>
    <summary-date>2012-02-14</summary-date>
  </summary>
  <bankFee>1500.75</bankFee>
</doc>

I can use a brute force XQuery code snippet like the following but I’d like to 
use the universal index.

What is the recommended approach for something like this?

 

Brute Force XQuery:

let $values :=

  for $n in $doc1//node()/*/text()

    let $value := try { xs:float($n) } catch ($e) { () }

    where $value gt 1000

      return

        $value||" | "||xdmp:path($n)

return $values

 

Gary Russo

Enterprise NoSQL Developer

http://garyrusso.wordpress.com

 


_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] What's a best approach to search unstructured docs for numbers Greater Than N?

Reply via email to