Re: [MarkLogic Dev General] RE: Help with co-occurrence (?) xquery

Jason Hunter Sun, 13 Dec 2009 15:00:59 -0800

Here's an idea.

Make <item> into its own fragment.  Put a range index on <a>.  Run two 
cts:element-values() calls and to each pass a cts:element-value-query() 
limiting <b> to values 1 (in the first) and 2 (for the second).  Use 
"intersect" on the two returned lists to find the set of <a> values that have 
<b> values of both 1 and 2.  I think this will work if you know in advance the 
values 1 and 2 that you're trying to match against.


If that does the job for you it has the performance advantage of not needing to 
read any fragments off disk.

Depending on what you know about your data and if it's fully normalized, you 
might be able to do it with one cts:element-values() call, passing a 
cts:or-query() allowing either <b> value, and look at cts:frequency on each <a> 
result to find hits that have a frequency of 2.

-jh-

On Dec 13, 2009, at 6:03 AM, Lee, David wrote:

> Thanks for the suggestions. I've solve the problem by re-organizing my
> XML.
> But I think the original problem is still "academically" interesting so
> let me restate it
> in a simpler form.
> 
> Suppose I have a document with a list of elements like
> 
> <item><a>1</a><b>1</b></item>
> <item><a>1</a><b>2</b></item>
> <item><a>2</a><b>3</b></item>
> <item><a>2</a><b>4</b></item>
> <item><a>3</a><b>5</b></item>
> <item><a>4</a><b>1</b></item>
> <item><a>4</a><b>7</b></item>
> <item><a>4</a><b>2</b></item>
> 
> To make sense of these, think of them as relation items.
> They are saying e.g
>   a=1  is related both b=1 and b=2
> Another way to look at this data is "property" data
> for item a=2  it has 2 properties  (3,4)
> 
> My Query is a co-occurrence like query.
> I want to find all <a> values where there exists 2 relations (or
> properties) of specific values.
> 
> In this case both <a> and <b> refer to elements in a different document
> (linked via the <a> value)
> 
> 
> So I cant use the co-occurrence search in ML (requires the docs to be in
> the same fragment).
> That leaves me with pure XQuery, where I'm forced to loop over all
> values of a
> 
> Something like this.  (Havent tried it but its close I think)
> 
> for $a in fn:disinct-values( //item/a/string() )
> let $as := //item[a eq $a]
> where  exists($as[b eq $value1) and exists($as[b eq $value2])
> return $a
> 
> 
> The problem is that this is extremely slow when I have 500,000 item
> entries.
> None of the suggestions solved the basic problem that I cant find
> another approach
> to do a query which doesnt require a loop over all values of  <a>
> there is no (to my finding) 
>   cts:search( //item ,
>       magic search that returns matches where a is the same and b
> matches both $value1 and $value2 )
> 
> 
> 
> The final answer ?
> What I should have done in the first place.  DE-normalize the data.
> As a second pass at data loading ,I've put all of these relations
> *within* the master element,
> as apposed to a separate relations document. this looks something like
> 
> <element>
>   <id>1</id>
>   <value1> ... </value1>
>   ...
>   <items>
>       <item><a>1</a><b>1</b></item>
>       <item><a>1</a><b>2</b></item>
>   <items>
> </element>
> 
> 
> It made the data size a bit bigger, because in reality these are
> relations, not properties,
> so I had to include in elements id=x  all items where either a=x or b=x
> so potentially it doubled the number of <item> elements overall.
> But the end result is much easier to query.
> 
> 
> 
> Now the query is something ML can optimize (and does very well)
> 
>    //element/[items/item/b = $value1 and items/item/b = $value2]
> 
> Or this can be reformulated into a cts:search easily (with an and
> query), which is what I did.
> 
> 
> 
> 
> So the conclusion ? 
> what I'm finding (and what was suggested earlier in another question)
> is that when given normalized relational type data , its often best to
> DE-normalize the data
> before loading into ML.
> The problem I have is that the raw data is pretty big and doesnt fit
> into memory,
> so I need a database to load the raw data in order to de-normalize it
> !!!
> And even when it does fit into memory,  memory based xquery code handles
> it pretty badly because
> there is no indexing, so it takes forever to run the denormalization.
> 
> So I'm left with these 2 options ... other suggestions welcome.
> 
> 1) Load the relational data into RDBMS
> Eg. load the "flat" data into something like mysql.  Then use a
> programming language and SQL code to produce de-normalized XML.  (I'm
> thinking of extending xmlsh's  xsql to handle master-detail queries to
> do this).
> 
> 2) Load the Flat data into a separate ML database (or directory), but
> probably a totally separate DB which has its settings tuned for fastest
> load (i.e. all the wildcard and stemmed searching turned off), then run
> xquery on this DB to produce the de-normalized XML back to the
> filesystem then load that data into the target DB.  I have found I have
> to do this iteratively because the resultant XML document is too big to
> fit in memory so XCC crashes out if I produce one big denormalized file.
> 
> 
> 
> Any other suggestions on how people de-normalize flat data to load into
> ML ?
> 
> Thanks for any suggestions.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Danny
> Sokolsky
> Sent: Saturday, December 12, 2009 6:55 PM
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] RE: Help with co-occurance (?) xquery
> 
> Hi David,
> 
> I am not sure I understand what you are trying to do, but here are a few
> ideas:
> 
> * consider creating a range index on the CONCEPTID element.  Then you
> can use cts:element-values to get all of the unique CONCEPTID values
> very quickly (and you can use the cts:query parameter to constrain it to
> a cts:query--like a cts:directory-query)
> 
> * the predicate [CONCEPTID2 = ($d1,$d2)] will return true if *either*
> value is there, I think (not sure) that you wanted both to be there.
> 
> * if you know the full path to your elements, that is preferable to
> using //
> 
> -Danny
> ________________________________________
> From: [email protected]
> [[email protected]] On Behalf Of Lee, David
> [[email protected]]
> Sent: Friday, December 11, 2009 6:30 PM
> To: [email protected]
> Subject: [MarkLogic Dev General] Help with co-occurance (?) xquery
> 
> If anyone has any suggestions for this I'd love to hear them.
> 
> I have a bunch of records (500k+) of elements like this:
> 
> <RELATIONSHIP>
>    <RELATIONSHIPID>1149315021</RELATIONSHIPID>
>    <CONCEPTID1>7826003</CONCEPTID1>
>    <RELATIONSHIPTYPE>246456000</RELATIONSHIPTYPE>
>    <CONCEPTID2>288526004</CONCEPTID2>
>    <CHARACTERISTICTYPE>1</CHARACTERISTICTYPE>
>    <REFINABILITY>2</REFINABILITY>
>    <RELATIONSHIPGROUP>0</RELATIONSHIPGROUP>
>  </RELATIONSHIP>
> 
> 
> Given 2 concepts ID's .. .I want to query for RELATIONSHIP records where
> CONCEPTID1 is the same, and CONCEPTID2 matches the 2 ID's I have.
> 
> I also have a 'master' record set of <CONCEPT> which lists all the
> concept ID's if that helps.
> I'm trying this nieve xquery ... which hasnt completed yet:
> 
> let $d1 := 387494007,
>    $d2 := 387458008
> 
> for $c in xdmp:directory("/SNOMED/concepts/")//CONCEPT
> let $cid := $c/CONCEPTID/string()
> return
>   xdmp:directory("/SNOMED/relationships/")//RELATIONSHIP[CONCEPTID1 eq
> $cid][CONCEPTID2 = ($d1,$d2)]
> 
> 
> I dont think its quite what I'm looking for but its close.
> Problem is 10 minutes later it hasnt returned yet.
> I'm sure its not using any kind of indexing which is going to be way too
> slow.
> I'm looking at the cts:element-value-co-occurrences
> 
> which seems to be very close to what I want but these 2 relationship
> elements dont co-occur within the same fragment.
> My next thought is to regenerate the data putting all relationships
> where CONCEPTID1 inside the CONCEPT element which is associated to it.
> That's probably a better design anyway ...
> but any suggestions on how to query this in a different way very
> welcome.
> 
> 
> 
> 
> 
> 
> 
> ----------------------------------------
> David A. Lee
> Senior Principal Software Engineer
> Epocrates, Inc.
> [email protected]<mailto:[email protected]>
> 812-482-5224
> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] RE: Help with co-occurrence (?) xquery

Reply via email to