2009/12/13 Jason Hunter <[email protected]>: > Here's an idea. > > Make <item> into its own fragment. Put a range index on <a>. Run two > cts:element-values() calls and to each pass a cts:element-value-query() > limiting <b> to values 1 (in the first) and 2 (for the second). Use > "intersect" on the two returned lists to find the set of <a> values that have > <b> values of both 1 and 2. I think this will work if you know in advance > the values 1 and 2 that you're trying to match against. >
Hi Jason, If the input is like: <item><a>1</a><b>1</b></item> <item><a>1</a><b>2</b></item> then intersect won't help as it would need to be the same <a> element, rather than two elements with the same name and value, if I've understood your suggestion correctly? cheers andrew > If that does the job for you it has the performance advantage of not needing > to read any fragments off disk. > > Depending on what you know about your data and if it's fully normalized, you > might be able to do it with one cts:element-values() call, passing a > cts:or-query() allowing either <b> value, and look at cts:frequency on each > <a> result to find hits that have a frequency of 2. > > -jh- > > On Dec 13, 2009, at 6:03 AM, Lee, David wrote: > >> Thanks for the suggestions. I've solve the problem by re-organizing my >> XML. >> But I think the original problem is still "academically" interesting so >> let me restate it >> in a simpler form. >> >> Suppose I have a document with a list of elements like >> >> <item><a>1</a><b>1</b></item> >> <item><a>1</a><b>2</b></item> >> <item><a>2</a><b>3</b></item> >> <item><a>2</a><b>4</b></item> >> <item><a>3</a><b>5</b></item> >> <item><a>4</a><b>1</b></item> >> <item><a>4</a><b>7</b></item> >> <item><a>4</a><b>2</b></item> >> >> To make sense of these, think of them as relation items. >> They are saying e.g >> a=1 is related both b=1 and b=2 >> Another way to look at this data is "property" data >> for item a=2 it has 2 properties (3,4) >> >> My Query is a co-occurrence like query. >> I want to find all <a> values where there exists 2 relations (or >> properties) of specific values. >> >> In this case both <a> and <b> refer to elements in a different document >> (linked via the <a> value) >> >> >> So I cant use the co-occurrence search in ML (requires the docs to be in >> the same fragment). >> That leaves me with pure XQuery, where I'm forced to loop over all >> values of a >> >> Something like this. (Havent tried it but its close I think) >> >> for $a in fn:disinct-values( //item/a/string() ) >> let $as := //item[a eq $a] >> where exists($as[b eq $value1) and exists($as[b eq $value2]) >> return $a >> >> >> The problem is that this is extremely slow when I have 500,000 item >> entries. >> None of the suggestions solved the basic problem that I cant find >> another approach >> to do a query which doesnt require a loop over all values of <a> >> there is no (to my finding) >> cts:search( //item , >> magic search that returns matches where a is the same and b >> matches both $value1 and $value2 ) >> >> >> >> The final answer ? >> What I should have done in the first place. DE-normalize the data. >> As a second pass at data loading ,I've put all of these relations >> *within* the master element, >> as apposed to a separate relations document. this looks something like >> >> <element> >> <id>1</id> >> <value1> ... </value1> >> ... >> <items> >> <item><a>1</a><b>1</b></item> >> <item><a>1</a><b>2</b></item> >> <items> >> </element> >> >> >> It made the data size a bit bigger, because in reality these are >> relations, not properties, >> so I had to include in elements id=x all items where either a=x or b=x >> so potentially it doubled the number of <item> elements overall. >> But the end result is much easier to query. >> >> >> >> Now the query is something ML can optimize (and does very well) >> >> //element/[items/item/b = $value1 and items/item/b = $value2] >> >> Or this can be reformulated into a cts:search easily (with an and >> query), which is what I did. >> >> >> >> >> So the conclusion ? >> what I'm finding (and what was suggested earlier in another question) >> is that when given normalized relational type data , its often best to >> DE-normalize the data >> before loading into ML. >> The problem I have is that the raw data is pretty big and doesnt fit >> into memory, >> so I need a database to load the raw data in order to de-normalize it >> !!! >> And even when it does fit into memory, memory based xquery code handles >> it pretty badly because >> there is no indexing, so it takes forever to run the denormalization. >> >> So I'm left with these 2 options ... other suggestions welcome. >> >> 1) Load the relational data into RDBMS >> Eg. load the "flat" data into something like mysql. Then use a >> programming language and SQL code to produce de-normalized XML. (I'm >> thinking of extending xmlsh's xsql to handle master-detail queries to >> do this). >> >> 2) Load the Flat data into a separate ML database (or directory), but >> probably a totally separate DB which has its settings tuned for fastest >> load (i.e. all the wildcard and stemmed searching turned off), then run >> xquery on this DB to produce the de-normalized XML back to the >> filesystem then load that data into the target DB. I have found I have >> to do this iteratively because the resultant XML document is too big to >> fit in memory so XCC crashes out if I produce one big denormalized file. >> >> >> >> Any other suggestions on how people de-normalize flat data to load into >> ML ? >> >> Thanks for any suggestions. >> >> >> >> >> >> >> >> >> >> >> >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Danny >> Sokolsky >> Sent: Saturday, December 12, 2009 6:55 PM >> To: General Mark Logic Developer Discussion >> Subject: [MarkLogic Dev General] RE: Help with co-occurance (?) xquery >> >> Hi David, >> >> I am not sure I understand what you are trying to do, but here are a few >> ideas: >> >> * consider creating a range index on the CONCEPTID element. Then you >> can use cts:element-values to get all of the unique CONCEPTID values >> very quickly (and you can use the cts:query parameter to constrain it to >> a cts:query--like a cts:directory-query) >> >> * the predicate [CONCEPTID2 = ($d1,$d2)] will return true if *either* >> value is there, I think (not sure) that you wanted both to be there. >> >> * if you know the full path to your elements, that is preferable to >> using // >> >> -Danny >> ________________________________________ >> From: [email protected] >> [[email protected]] On Behalf Of Lee, David >> [[email protected]] >> Sent: Friday, December 11, 2009 6:30 PM >> To: [email protected] >> Subject: [MarkLogic Dev General] Help with co-occurance (?) xquery >> >> If anyone has any suggestions for this I'd love to hear them. >> >> I have a bunch of records (500k+) of elements like this: >> >> <RELATIONSHIP> >> <RELATIONSHIPID>1149315021</RELATIONSHIPID> >> <CONCEPTID1>7826003</CONCEPTID1> >> <RELATIONSHIPTYPE>246456000</RELATIONSHIPTYPE> >> <CONCEPTID2>288526004</CONCEPTID2> >> <CHARACTERISTICTYPE>1</CHARACTERISTICTYPE> >> <REFINABILITY>2</REFINABILITY> >> <RELATIONSHIPGROUP>0</RELATIONSHIPGROUP> >> </RELATIONSHIP> >> >> >> Given 2 concepts ID's .. .I want to query for RELATIONSHIP records where >> CONCEPTID1 is the same, and CONCEPTID2 matches the 2 ID's I have. >> >> I also have a 'master' record set of <CONCEPT> which lists all the >> concept ID's if that helps. >> I'm trying this nieve xquery ... which hasnt completed yet: >> >> let $d1 := 387494007, >> $d2 := 387458008 >> >> for $c in xdmp:directory("/SNOMED/concepts/")//CONCEPT >> let $cid := $c/CONCEPTID/string() >> return >> xdmp:directory("/SNOMED/relationships/")//RELATIONSHIP[CONCEPTID1 eq >> $cid][CONCEPTID2 = ($d1,$d2)] >> >> >> I dont think its quite what I'm looking for but its close. >> Problem is 10 minutes later it hasnt returned yet. >> I'm sure its not using any kind of indexing which is going to be way too >> slow. >> I'm looking at the cts:element-value-co-occurrences >> >> which seems to be very close to what I want but these 2 relationship >> elements dont co-occur within the same fragment. >> My next thought is to regenerate the data putting all relationships >> where CONCEPTID1 inside the CONCEPT element which is associated to it. >> That's probably a better design anyway ... >> but any suggestions on how to query this in a different way very >> welcome. >> >> >> >> >> >> >> >> ---------------------------------------- >> David A. Lee >> Senior Principal Software Engineer >> Epocrates, Inc. >> [email protected]<mailto:[email protected]> >> 812-482-5224 >> >> >> _______________________________________________ >> General mailing list >> [email protected] >> http://xqzone.com/mailman/listinfo/general >> _______________________________________________ >> General mailing list >> [email protected] >> http://xqzone.com/mailman/listinfo/general > > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general > -- Andrew Welch http://andrewjwelch.com Kernow: http://kernowforsaxon.sf.net/ _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
