RE: [MarkLogic Dev General] RE: Help with co-occurrence (?) xquery

Lee, David Sun, 13 Dec 2009 06:03:17 -0800

Thanks for the suggestions. I've solve the problem by re-organizing my
XML.
But I think the original problem is still "academically" interesting so
let me restate it
in a simpler form.


Suppose I have a document with a list of elements like

<item><a>1</a><b>1</b></item>
<item><a>1</a><b>2</b></item>
<item><a>2</a><b>3</b></item>
<item><a>2</a><b>4</b></item>
<item><a>3</a><b>5</b></item>
<item><a>4</a><b>1</b></item>
<item><a>4</a><b>7</b></item>
<item><a>4</a><b>2</b></item>

To make sense of these, think of them as relation items.
They are saying e.g
   a=1  is related both b=1 and b=2
Another way to look at this data is "property" data
for item a=2  it has 2 properties  (3,4)

My Query is a co-occurrence like query.
I want to find all <a> values where there exists 2 relations (or
properties) of specific values.

In this case both <a> and <b> refer to elements in a different document
(linked via the <a> value)


So I cant use the co-occurrence search in ML (requires the docs to be in
the same fragment).
That leaves me with pure XQuery, where I'm forced to loop over all
values of a

Something like this.  (Havent tried it but its close I think)

for $a in fn:disinct-values( //item/a/string() )
let $as := //item[a eq $a]
where  exists($as[b eq $value1) and exists($as[b eq $value2])
return $a


The problem is that this is extremely slow when I have 500,000 item
entries.
None of the suggestions solved the basic problem that I cant find
another approach
to do a query which doesnt require a loop over all values of  <a>
there is no (to my finding) 
   cts:search( //item ,
       magic search that returns matches where a is the same and b
matches both $value1 and $value2 )



The final answer ?
What I should have done in the first place.  DE-normalize the data.
As a second pass at data loading ,I've put all of these relations
*within* the master element,
as apposed to a separate relations document. this looks something like

<element>
   <id>1</id>
   <value1> ... </value1>
   ...
   <items>
        <item><a>1</a><b>1</b></item>
        <item><a>1</a><b>2</b></item>
   <items>
</element>


It made the data size a bit bigger, because in reality these are
relations, not properties,
so I had to include in elements id=x  all items where either a=x or b=x
so potentially it doubled the number of <item> elements overall.
But the end result is much easier to query.



Now the query is something ML can optimize (and does very well)

    //element/[items/item/b = $value1 and items/item/b = $value2]

Or this can be reformulated into a cts:search easily (with an and
query), which is what I did.




So the conclusion ? 
what I'm finding (and what was suggested earlier in another question)
is that when given normalized relational type data , its often best to
DE-normalize the data
before loading into ML.
The problem I have is that the raw data is pretty big and doesnt fit
into memory,
so I need a database to load the raw data in order to de-normalize it
!!!
And even when it does fit into memory,  memory based xquery code handles
it pretty badly because
there is no indexing, so it takes forever to run the denormalization.

So I'm left with these 2 options ... other suggestions welcome.

1) Load the relational data into RDBMS
Eg. load the "flat" data into something like mysql.  Then use a
programming language and SQL code to produce de-normalized XML.  (I'm
thinking of extending xmlsh's  xsql to handle master-detail queries to
do this).

2) Load the Flat data into a separate ML database (or directory), but
probably a totally separate DB which has its settings tuned for fastest
load (i.e. all the wildcard and stemmed searching turned off), then run
xquery on this DB to produce the de-normalized XML back to the
filesystem then load that data into the target DB.  I have found I have
to do this iteratively because the resultant XML document is too big to
fit in memory so XCC crashes out if I produce one big denormalized file.



Any other suggestions on how people de-normalize flat data to load into
ML ?

Thanks for any suggestions.












-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Danny
Sokolsky
Sent: Saturday, December 12, 2009 6:55 PM
To: General Mark Logic Developer Discussion
Subject: [MarkLogic Dev General] RE: Help with co-occurance (?) xquery

Hi David,

I am not sure I understand what you are trying to do, but here are a few
ideas:

* consider creating a range index on the CONCEPTID element.  Then you
can use cts:element-values to get all of the unique CONCEPTID values
very quickly (and you can use the cts:query parameter to constrain it to
a cts:query--like a cts:directory-query)

* the predicate [CONCEPTID2 = ($d1,$d2)] will return true if *either*
value is there, I think (not sure) that you wanted both to be there.

* if you know the full path to your elements, that is preferable to
using //

-Danny
________________________________________
From: [email protected]
[[email protected]] On Behalf Of Lee, David
[[email protected]]
Sent: Friday, December 11, 2009 6:30 PM
To: [email protected]
Subject: [MarkLogic Dev General] Help with co-occurance (?) xquery

If anyone has any suggestions for this I'd love to hear them.

I have a bunch of records (500k+) of elements like this:

<RELATIONSHIP>
    <RELATIONSHIPID>1149315021</RELATIONSHIPID>
    <CONCEPTID1>7826003</CONCEPTID1>
    <RELATIONSHIPTYPE>246456000</RELATIONSHIPTYPE>
    <CONCEPTID2>288526004</CONCEPTID2>
    <CHARACTERISTICTYPE>1</CHARACTERISTICTYPE>
    <REFINABILITY>2</REFINABILITY>
    <RELATIONSHIPGROUP>0</RELATIONSHIPGROUP>
  </RELATIONSHIP>


Given 2 concepts ID's .. .I want to query for RELATIONSHIP records where
CONCEPTID1 is the same, and CONCEPTID2 matches the 2 ID's I have.

I also have a 'master' record set of <CONCEPT> which lists all the
concept ID's if that helps.
I'm trying this nieve xquery ... which hasnt completed yet:

let $d1 := 387494007,
    $d2 := 387458008

for $c in xdmp:directory("/SNOMED/concepts/")//CONCEPT
let $cid := $c/CONCEPTID/string()
return
   xdmp:directory("/SNOMED/relationships/")//RELATIONSHIP[CONCEPTID1 eq
$cid][CONCEPTID2 = ($d1,$d2)]


I dont think its quite what I'm looking for but its close.
Problem is 10 minutes later it hasnt returned yet.
I'm sure its not using any kind of indexing which is going to be way too
slow.
I'm looking at the cts:element-value-co-occurrences

which seems to be very close to what I want but these 2 relationship
elements dont co-occur within the same fragment.
My next thought is to regenerate the data putting all relationships
where CONCEPTID1 inside the CONCEPT element which is associated to it.
That's probably a better design anyway ...
but any suggestions on how to query this in a different way very
welcome.







----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]>
812-482-5224


_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] RE: Help with co-occurrence (?) xquery

Reply via email to