>> Option 2: Mark up the documents with just the (assumed unique) leaf node
>> values. Maintain a separate declarative document with the hierarchy
>> showing how the leaf node values fit together.
>>
> Is the idea here to store the entire hierarchy in a single document, or
> across multiple documents (e.g. one document for each node and its direct
> children)? In some cases the hierarchy is actually a directed acyclic graph,
> so it doesn't fit neatly into the XML document model.
You could do it by representing the edges. But I bet you'll be better served
with Option 3 from your reply.
>> Perhaps that's more
>> useful. You'll do your query and quickly fetch all the leaf node
>> values, and when you want to show facets above the leaf nodes just
>> do some coalescing math. The performance should be good.
>>
> Yes, this sounds right. My assumption, however, has been that the coalescing
> operations would be expensive if you have to do 100s of them in order to
> generate a search result page. I saw enough good stuff at the MLUC that I'm
> questioning that assumption.
The coalescing is just simple math. Computers fortunately excel at that. :)
>> As an example, if you're modeling a biological taxonomy, you can quickly
>> find the distinct number and count of animals matching any query, and then
>> if you want to show mammals vs reptiles you walk the list of distinct
>> animal matches and use your declarative document to figure out how many you
>> have of each. Use the MarkLogic "map" API and I expect this will be very
>> fast even for thousands of distinct animals which is probably more than
>> you have in your case.
>>
> Yes, we are large-scale biomedical taxonomies that can have > 1M terms in
> them. We're willing to put this on big hardware, so in-memory mapping sounds
> like a good option. Just to make sure we're talking about the same thing, I'm
> envisioning a model where a 1M document collection for veterinary medicne
> would be tagged by animal and by disease. If a user searches for "Snake" and
> "Infection", where "Infection" is marked as a synonym of infectious disease
> (it isn't really a synonym, but ignore that), then we would show all matching
> documents along with a faceted browsing UI that looks something like:
>
> - Animal (1,000,000)
> - Reptile (10,394)
> * Snake (3,040)
> - Cobra (1,000)
> - Viper (1,500)
> - Seasnake (500)
>
> - Disease (1,000,000)
> * Infectious Disease (250,000)
> - Viral infection (111,000)
> - Bacterial infection (139,000)
So let's represent the data as strings like this:
Animal:Reptile:Snake:Cobra
Animal:Reptile:Snake:Viper
Animal:Reptile:Snake:Seasnake
Disease:Infectious Disease:Viral infection
Disease:Infectious Disease:Bacterial infection
You can have these in <taxonomy> tags or <animal-taxonomy> and
<disease-taxonomy> depending on how many taxonomies you want to have per doc.
For a given query you can use lexicons (cts:element-values() and
cts:frequency() on each one) to get a list of taxonomic strings and counts in
the result set. It'll give you data like this:
Animal:Reptile:Snake:Cobra 1,000
Animal:Reptile:Snake:Viper 1,500
Animal:Reptile:Snake:Seasnake 500
Disease:Infectious Disease:Viral infection 111,000
Disease:Infectious Disease:Bacterial infection 139,000
MarkLogic will do that for you and will do it quickly against a large data set
and you can choose to limit by any ad hoc cts:query. Your job to make the
hierarchy then is to transform this into the hierarchical representation you
showed above. There's no heavy lifting there, just basic grouping. XSLT
people might even tell you there's tricks in XSLT to do the grouping? I'd
personally just do a loop with a map:map() maintained for each level. I could
even write that loop code for you if you get stuck.
Even with 50 child terms, that will run very quickly. How many map operations
can you do per second? Lots more than 50.
All that's about efficiently showing the hierarchy against either the full
corpus or a subset defined by a query.
>> If you want to limit a query to a certain parent node (i.e. reptiles), you'd
>> use an or-query for
>> the leaf nodes. That's how the thesaurus works in essence.
>>
> This is where I get stuck. I think I need 3 data structures, and I can't tell
> which ones ML will give me. I need
>
> 1. a way to mark up documents with leaf terms.
> 2. a way to map leaf terms to parent terms.
> 3. a way to identify documents that match a given term or its children.
>
> So in the above example, to calculate the 3,040 count of documents matching
> Snake, I need to find all documents that are tagged Snake, Cobra, Viper or
> Seasnake. So first I need to find the children of Snake, then find all
> documents that mention any of those children, and then calculate the final
> aggregate. Note that in the above example, Cobra + Viper + Seasnake counts =
> 3,000, implying that there are another 40 documents that mention "Snake"
> directly.
The easiest way is to limit your query to documents that have "Animal:Snake" as
a phrase in the taxonomy element. Use the full text indexes to your advantage.
cts:element-word-query(xs:QName("taxonomy"), cts:word-query("Animal:Snake")).
Make sure you have element word indexes on of course.
> Further, given that I also want to display summary counts for the immediate
> children of Snake, I could re-do this for Cobra, for Viper, etc. Alternately,
> since those aggregates had to be calculated along the way to calculating the
> 3,040 count for Snake, maybe there is a way to collect them along the way and
> avoid recalculation.
Right, you want to skip the recalculation and do a lexicon call passing in a
cts:query that says the taxonomy has to have the phrase Animal:Snake. Example:
let $query :=
cts:element-word-query(xs:QName("taxonomy"), cts:word-query("animal:snake"))
for $val in cts:element-values(xs:QName("taxonomy"), "", (), $query)
let $count := cts:frequency($val)
return ($val, $count)
You make a query limiting to a subset of the taxonomy. You fetch all taxonomic
values within that subset and get their counts. Then from that you can build
the hierarchy using maps. Here I just return a list of values and counts.
Am I on the right track?
-jh-
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general