Re: [MarkLogic Dev General] RE: Help with co-occurrence (?) xquery

Andrew Welch Mon, 14 Dec 2009 01:16:11 -0800

2009/12/13 Jason Hunter <[email protected]>:
> Here's an idea.
>
> Make <item> into its own fragment.  Put a range index on <a>.  Run two 
> cts:element-values() calls and to each pass a cts:element-value-query() 
> limiting <b> to values 1 (in the first) and 2 (for the second).  Use 
> "intersect" on the two returned lists to find the set of <a> values that have 
> <b> values of both 1 and 2.  I think this will work if you know in advance 
> the values 1 and 2 that you're trying to match against.
>


Hi Jason,

If the input is like:

<item><a>1</a><b>1</b></item>
<item><a>1</a><b>2</b></item>

then intersect won't help as it would need to be the same <a> element,
rather than two elements with the same name and value, if I've
understood your suggestion correctly?

cheers
andrew



> If that does the job for you it has the performance advantage of not needing 
> to read any fragments off disk.
>
> Depending on what you know about your data and if it's fully normalized, you 
> might be able to do it with one cts:element-values() call, passing a 
> cts:or-query() allowing either <b> value, and look at cts:frequency on each 
> <a> result to find hits that have a frequency of 2.
>
> -jh-
>
> On Dec 13, 2009, at 6:03 AM, Lee, David wrote:
>
>> Thanks for the suggestions. I've solve the problem by re-organizing my
>> XML.
>> But I think the original problem is still "academically" interesting so
>> let me restate it
>> in a simpler form.
>>
>> Suppose I have a document with a list of elements like
>>
>> <item><a>1</a><b>1</b></item>
>> <item><a>1</a><b>2</b></item>
>> <item><a>2</a><b>3</b></item>
>> <item><a>2</a><b>4</b></item>
>> <item><a>3</a><b>5</b></item>
>> <item><a>4</a><b>1</b></item>
>> <item><a>4</a><b>7</b></item>
>> <item><a>4</a><b>2</b></item>
>>
>> To make sense of these, think of them as relation items.
>> They are saying e.g
>>   a=1  is related both b=1 and b=2
>> Another way to look at this data is "property" data
>> for item a=2  it has 2 properties  (3,4)
>>
>> My Query is a co-occurrence like query.
>> I want to find all <a> values where there exists 2 relations (or
>> properties) of specific values.
>>
>> In this case both <a> and <b> refer to elements in a different document
>> (linked via the <a> value)
>>
>>
>> So I cant use the co-occurrence search in ML (requires the docs to be in
>> the same fragment).
>> That leaves me with pure XQuery, where I'm forced to loop over all
>> values of a
>>
>> Something like this.  (Havent tried it but its close I think)
>>
>> for $a in fn:disinct-values( //item/a/string() )
>> let $as := //item[a eq $a]
>> where  exists($as[b eq $value1) and exists($as[b eq $value2])
>> return $a
>>
>>
>> The problem is that this is extremely slow when I have 500,000 item
>> entries.
>> None of the suggestions solved the basic problem that I cant find
>> another approach
>> to do a query which doesnt require a loop over all values of  <a>
>> there is no (to my finding)
>>   cts:search( //item ,
>>       magic search that returns matches where a is the same and b
>> matches both $value1 and $value2 )
>>
>>
>>
>> The final answer ?
>> What I should have done in the first place.  DE-normalize the data.
>> As a second pass at data loading ,I've put all of these relations
>> *within* the master element,
>> as apposed to a separate relations document. this looks something like
>>
>> <element>
>>   <id>1</id>
>>   <value1> ... </value1>
>>   ...
>>   <items>
>>       <item><a>1</a><b>1</b></item>
>>       <item><a>1</a><b>2</b></item>
>>   <items>
>> </element>
>>
>>
>> It made the data size a bit bigger, because in reality these are
>> relations, not properties,
>> so I had to include in elements id=x  all items where either a=x or b=x
>> so potentially it doubled the number of <item> elements overall.
>> But the end result is much easier to query.
>>
>>
>>
>> Now the query is something ML can optimize (and does very well)
>>
>>    //element/[items/item/b = $value1 and items/item/b = $value2]
>>
>> Or this can be reformulated into a cts:search easily (with an and
>> query), which is what I did.
>>
>>
>>
>>
>> So the conclusion ?
>> what I'm finding (and what was suggested earlier in another question)
>> is that when given normalized relational type data , its often best to
>> DE-normalize the data
>> before loading into ML.
>> The problem I have is that the raw data is pretty big and doesnt fit
>> into memory,
>> so I need a database to load the raw data in order to de-normalize it
>> !!!
>> And even when it does fit into memory,  memory based xquery code handles
>> it pretty badly because
>> there is no indexing, so it takes forever to run the denormalization.
>>
>> So I'm left with these 2 options ... other suggestions welcome.
>>
>> 1) Load the relational data into RDBMS
>> Eg. load the "flat" data into something like mysql.  Then use a
>> programming language and SQL code to produce de-normalized XML.  (I'm
>> thinking of extending xmlsh's  xsql to handle master-detail queries to
>> do this).
>>
>> 2) Load the Flat data into a separate ML database (or directory), but
>> probably a totally separate DB which has its settings tuned for fastest
>> load (i.e. all the wildcard and stemmed searching turned off), then run
>> xquery on this DB to produce the de-normalized XML back to the
>> filesystem then load that data into the target DB.  I have found I have
>> to do this iteratively because the resultant XML document is too big to
>> fit in memory so XCC crashes out if I produce one big denormalized file.
>>
>>
>>
>> Any other suggestions on how people de-normalize flat data to load into
>> ML ?
>>
>> Thanks for any suggestions.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Danny
>> Sokolsky
>> Sent: Saturday, December 12, 2009 6:55 PM
>> To: General Mark Logic Developer Discussion
>> Subject: [MarkLogic Dev General] RE: Help with co-occurance (?) xquery
>>
>> Hi David,
>>
>> I am not sure I understand what you are trying to do, but here are a few
>> ideas:
>>
>> * consider creating a range index on the CONCEPTID element.  Then you
>> can use cts:element-values to get all of the unique CONCEPTID values
>> very quickly (and you can use the cts:query parameter to constrain it to
>> a cts:query--like a cts:directory-query)
>>
>> * the predicate [CONCEPTID2 = ($d1,$d2)] will return true if *either*
>> value is there, I think (not sure) that you wanted both to be there.
>>
>> * if you know the full path to your elements, that is preferable to
>> using //
>>
>> -Danny
>> ________________________________________
>> From: [email protected]
>> [[email protected]] On Behalf Of Lee, David
>> [[email protected]]
>> Sent: Friday, December 11, 2009 6:30 PM
>> To: [email protected]
>> Subject: [MarkLogic Dev General] Help with co-occurance (?) xquery
>>
>> If anyone has any suggestions for this I'd love to hear them.
>>
>> I have a bunch of records (500k+) of elements like this:
>>
>> <RELATIONSHIP>
>>    <RELATIONSHIPID>1149315021</RELATIONSHIPID>
>>    <CONCEPTID1>7826003</CONCEPTID1>
>>    <RELATIONSHIPTYPE>246456000</RELATIONSHIPTYPE>
>>    <CONCEPTID2>288526004</CONCEPTID2>
>>    <CHARACTERISTICTYPE>1</CHARACTERISTICTYPE>
>>    <REFINABILITY>2</REFINABILITY>
>>    <RELATIONSHIPGROUP>0</RELATIONSHIPGROUP>
>>  </RELATIONSHIP>
>>
>>
>> Given 2 concepts ID's .. .I want to query for RELATIONSHIP records where
>> CONCEPTID1 is the same, and CONCEPTID2 matches the 2 ID's I have.
>>
>> I also have a 'master' record set of <CONCEPT> which lists all the
>> concept ID's if that helps.
>> I'm trying this nieve xquery ... which hasnt completed yet:
>>
>> let $d1 := 387494007,
>>    $d2 := 387458008
>>
>> for $c in xdmp:directory("/SNOMED/concepts/")//CONCEPT
>> let $cid := $c/CONCEPTID/string()
>> return
>>   xdmp:directory("/SNOMED/relationships/")//RELATIONSHIP[CONCEPTID1 eq
>> $cid][CONCEPTID2 = ($d1,$d2)]
>>
>>
>> I dont think its quite what I'm looking for but its close.
>> Problem is 10 minutes later it hasnt returned yet.
>> I'm sure its not using any kind of indexing which is going to be way too
>> slow.
>> I'm looking at the cts:element-value-co-occurrences
>>
>> which seems to be very close to what I want but these 2 relationship
>> elements dont co-occur within the same fragment.
>> My next thought is to regenerate the data putting all relationships
>> where CONCEPTID1 inside the CONCEPT element which is associated to it.
>> That's probably a better design anyway ...
>> but any suggestions on how to query this in a different way very
>> welcome.
>>
>>
>>
>>
>>
>>
>>
>> ----------------------------------------
>> David A. Lee
>> Senior Principal Software Engineer
>> Epocrates, Inc.
>> [email protected]<mailto:[email protected]>
>> 812-482-5224
>>
>>
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://xqzone.com/mailman/listinfo/general
>
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
>



-- 
Andrew Welch
http://andrewjwelch.com
Kernow: http://kernowforsaxon.sf.net/
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] RE: Help with co-occurrence (?) xquery

Reply via email to