Re: [MarkLogic Dev General] What's a best approach to search unstructured docs for numbers Greater Than N?

David Ennis Wed, 29 Oct 2014 01:56:56 -0700

HI.

Related to Justin's approach:

1) I would start this by using a field with a path defined (and appropriate
indexes).
This would allow you to more easily tune where you look for the values over
time by adding more xpath expressions and using the exclude options.   But
see below - and the reason Justin was likely showing xPath expressions that
match elements that are values that can be cast properly.

2) However, I believe that anything related to indexes still requires that
the element you index do not have mixed content.  So, the generic approach
above without tuning to target elements that you know can be cast into the
correct datatype (integer, decimal, etc) will result in errors like this:
*Invalid cast: xs:untypedAtomic("250 9996.26426957374 Public Law 2012-02-14
Redesignates the Noxu...") cast as xs:integer*
So, if you really need to look in *each and every element of mixed content*
and want to get this via an index, I have a few suggestions:

1) Transform the documents and add the decimal value as an attribute.  This
is a relatively trivial recursive function that tests if an element's value
can be cast as a number and then adds the value as an attribute.  The test
(taken from FunxtX)is simply *string(number($value)) != 'NaN'*.

Your transformed document might look like this:
<doc>
  <prices>
    <beginningBalance *decimal-value='250'*>250</beginningBalance>
    <endingBalance *decimal-value='9996.26426957374'*
>9996.26426957374</endingBalance>
  </prices>
  <summary>
    <summary-as>Public Law</summary-as>
    <summary-date>2012-02-14</summary-date>
    <summary-text>Redesignates the Noxubee National Wildlife
Refuge.....</summary-text>
  </summary>
  <bankCharge* decimal-value='**1500.75'*>1500.75</bankCharge>
</doc>

Then you only need to index the attribute decimal-value. And this attribute
would only exist for values that are truly interesting.

OR 2) Don't want to alter your document - store these values in the
properties fragment:
Example in properties and then index the decimal-values element(s):
*<interesting-values>*
*  <decimal-value>250**</decimal-value>*
*  <decimal-value>9996.26426957374**</decimal-value>*
*  <decimal-value>**1500.75**</decimal-value>*
*</**interesting**-values>*

for this option, it is also possible to store the original path if you
really need to keep the context of the numbers.  I would store this and any
other interesting information as attributes of the decimal-value elements:
*<decimal-value xpath='/doc/prices/endingBalance**'
original-element-name='endingBalance'>**9996.26426957374**</decimal-value>*

This last format truly gives you flexability when mixed with the power of
fields as you can start to target and prioritize specific elements if
needed.

Kind Regards,
David Ennis

Kind Regards,
David Ennis

David Ennis
*Content Engineer*

[image: HintTech]  <http://www.hinttech.com/>
Mastering the value of content
creative | technology | content

Delftechpark 37i
2628 XJ Delft
The Netherlands
T: +31 88 268 25 00
M: +31 63 091 72 80

[image: http://www.hinttech.com] <http://www.hinttech.com>
<https://twitter.com/HintTech>  <http://www.facebook.com/HintTech>
<http://www.linkedin.com/company/HintTech>

On 29 October 2014 00:15, Gary Russo <[email protected]> wrote:

> I have a database that consists of highly unstructured documents. The
> documents contain pricing data.
>
>
>
> My requirement is to search all nodes of every document to find a number
> that is greater than 1,000.
>
>
>
> The element names containing the numbers can be anything.
>
>
>
> Example Docs:
>
>
>
> *Doc 1*
>
>
>
> <doc>
>   <prices>
>     <beginningBalance>250</beginningBalance>
>     <endingBalance>9996.26426957374</endingBalance>
>   </prices>
>   <summary>
>     <summary-as>Public Law</summary-as>
>     <summary-date>2012-02-14</summary-date>
>     <summary-text>Redesignates the Noxubee National Wildlife Refuge.....
> </summary-text>
>   </summary>
>   <bankCharge>1500.75</bankCharge>
> </doc>
>
>
>
> *Doc 2*
>
>
>
> <doc>
>   <cost>
>     <startBalance>250</startBalance>
>     <endBalance>9996.26426957374</endBalance>
>   </cost>
>   <summary>
>     <summary-as>Public Law</summary-as>
>     <summary-date>2012-02-14</summary-date>
>   </summary>
>   <bankFee>1500.75</bankFee>
> </doc>
>
>
>
> I can use a brute force XQuery code snippet like the following but I’d
> like to use the universal index.
>
>
>
> What is the recommended approach for something like this?
>
>
>
>
>
> *Brute Force XQuery:*
>
>
>
> let $values :=
>
>   for $n in $doc1//node()/*/text()
>
>     let $value := try { xs:float($n) } catch ($e) { () }
>
>     where $value gt 1000
>
>       return
>
>         $value||" | "||xdmp:path($n)
>
> return $values
>
>
>
>
>
>
>
>
>
> *Gary Russo*
>
> *Enterprise NoSQL Developer*
>
> *http://garyrusso.wordpress.com <http://garyrusso.wordpress.com>*
>
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
>

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] What's a best approach to search unstructured docs for numbers Greater Than N?

Reply via email to