Thanks Gary, David, and Justin for the suggestions.
I wound up using this path-range-index. => /tax:origin/tax:feed/tax:price//* The query response time is now 100X faster than the brute force XQuery approach. I tested this using a database with 100,000 “random schema” documents. Response time without path-range-index is 20 seconds. Response time with path-range-index is .2 seconds. The following sample doc shows the /tax:price//* node that contains the “random schema” structure. Sample Doc: <origin xmlns="http://tax.thomsonreuters.com"> <meta> <id>211</id> <importFileId>1</importFileId> <importedUnitCode>RU00211</importedUnitCode> <importedAccountCode>AC00790</importedAccountCode> <beginningBalance>0</beginningBalance> <endingBalance>8743.65477764218</endingBalance> <type>ledger</type> </meta> <feed> <price> <price78>462774.37</price78> <price0>554495.25</price0> <price1>911655.28</price1> <price2>687320.83</price2> <price30>680451.3</price30> <balances> <beginningBalance>0</beginningBalance> <endingBalance>8743.65477764218</endingBalance> <trialBalance> <trialBalance1>500872.92</trialBalance1> <trialBalance2>455478.49</trialBalance2> </trialBalance> </balances> </price> <body>pending</body> </feed> </origin> The following code snippet shows the function used. It creates a response doc that contains the found values with respective uri and xpath. Path-Range-Index Code Snippet: declare function tr:getValuesWithinRange($min as xs:decimal, $max as xs:decimal) as node()* { let $values := for $val in cts:values(cts:path-reference($pathIdx)) where $val ge $min and $val le $max return $val let $valuesDoc := element { "valueItems" } { for $value in $values let $results := cts:search(/tax:origin/tax:feed/tax:price, cts:word-query(xs:string($value))) return element { "valueItem" } { ( element { "value" } { $value }, element { "count" } { fn:count($results) }, element { "items" } { for $item in $results//*/text() let $path := xdmp:path($item/..) where $item eq $value return element { "item" } { element { "uri" } { xdmp:node-uri($item) }, element { "path" } { $path } } } ) } } let $response := element { "response" } { element { "input" } { element { "min" } { $min }, element { "max" } { $max } }, element { "elapsedTime" } { xdmp:elapsed-time() }, element { "uniqueValueCount" } { fn:count($values) }, element { "values" } { fn:string-join(xs:string($values), " ") }, $valuesDoc } return $response }; From: [email protected] [mailto:[email protected]] On Behalf Of David Ennis Sent: Wednesday, October 29, 2014 4:56 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] What's a best approach to search unstructured docs for numbers Greater Than N? HI. Related to Justin's approach: 1) I would start this by using a field with a path defined (and appropriate indexes). This would allow you to more easily tune where you look for the values over time by adding more xpath expressions and using the exclude options. But see below - and the reason Justin was likely showing xPath expressions that match elements that are values that can be cast properly. 2) However, I believe that anything related to indexes still requires that the element you index do not have mixed content. So, the generic approach above without tuning to target elements that you know can be cast into the correct datatype (integer, decimal, etc) will result in errors like this: Invalid cast: xs:untypedAtomic("250 9996.26426957374 Public Law 2012-02-14 Redesignates the Noxu...") cast as xs:integer So, if you really need to look in each and every element of mixed content and want to get this via an index, I have a few suggestions: 1) Transform the documents and add the decimal value as an attribute. This is a relatively trivial recursive function that tests if an element's value can be cast as a number and then adds the value as an attribute. The test (taken from FunxtX)is simply string(number($value)) != 'NaN'. Your transformed document might look like this: <doc> <prices> <beginningBalance decimal-value='250'>250</beginningBalance> <endingBalance decimal-value='9996.26426957374'>9996.26426957374</endingBalance> </prices> <summary> <summary-as>Public Law</summary-as> <summary-date>2012-02-14</summary-date> <summary-text>Redesignates the Noxubee National Wildlife Refuge.....</summary-text> </summary> <bankCharge decimal-value='1500.75'>1500.75</bankCharge> </doc> Then you only need to index the attribute decimal-value. And this attribute would only exist for values that are truly interesting. OR 2) Don't want to alter your document - store these values in the properties fragment: Example in properties and then index the decimal-values element(s): <interesting-values> <decimal-value>250</decimal-value> <decimal-value>9996.26426957374</decimal-value> <decimal-value>1500.75</decimal-value> </interesting-values> for this option, it is also possible to store the original path if you really need to keep the context of the numbers. I would store this and any other interesting information as attributes of the decimal-value elements: <decimal-value xpath='/doc/prices/endingBalance' original-element-name='endingBalance'>9996.26426957374</decimal-value> This last format truly gives you flexability when mixed with the power of fields as you can start to target and prioritize specific elements if needed. Kind Regards, David Ennis Kind Regards, David Ennis David Ennis Content Engineer <http://www.hinttech.com/> HintTech Mastering the value of content creative | technology | content Delftechpark 37i 2628 XJ Delft The Netherlands T: +31 88 268 25 00 M: +31 63 091 72 80 <http://www.hinttech.com> http://www.hinttech.com <https://twitter.com/HintTech> <http://www.facebook.com/HintTech> <http://www.linkedin.com/company/HintTech> On 29 October 2014 00:15, Gary Russo <[email protected]> wrote: I have a database that consists of highly unstructured documents. The documents contain pricing data. My requirement is to search all nodes of every document to find a number that is greater than 1,000. The element names containing the numbers can be anything. Example Docs: Doc 1 <doc> <prices> <beginningBalance>250</beginningBalance> <endingBalance>9996.26426957374</endingBalance> </prices> <summary> <summary-as>Public Law</summary-as> <summary-date>2012-02-14</summary-date> <summary-text>Redesignates the Noxubee National Wildlife Refuge.....</summary-text> </summary> <bankCharge>1500.75</bankCharge> </doc> Doc 2 <doc> <cost> <startBalance>250</startBalance> <endBalance>9996.26426957374</endBalance> </cost> <summary> <summary-as>Public Law</summary-as> <summary-date>2012-02-14</summary-date> </summary> <bankFee>1500.75</bankFee> </doc> I can use a brute force XQuery code snippet like the following but I’d like to use the universal index. What is the recommended approach for something like this? Brute Force XQuery: let $values := for $n in $doc1//node()/*/text() let $value := try { xs:float($n) } catch ($e) { () } where $value gt 1000 return $value||" | "||xdmp:path($n) return $values Gary Russo Enterprise NoSQL Developer http://garyrusso.wordpress.com _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
