(this is a follow up to
http://developer.marklogic.com/pipermail/general/2010-May/005460.html. I
deleted the original replies, so am faking a reply in the hopes it will still
show up as a threaded response)
Hi Jason and Danny --
Thanks for the feedback. I think Jason's examples are more along the lines I'm
looking for.
You have a few ways to do this, depending on your requirements.
> Option 2: Mark up the documents with just the (assumed unique) leaf node
> values. Maintain a separate declarative document with the hierarchy
> showing how the leaf node values fit together.
>
Is the idea here to store the entire hierarchy in a single document, or across
multiple documents (e.g. one document for each node and its direct children)?
In some cases the hierarchy is actually a directed acyclic graph, so it doesn't
fit neatly into the XML document model.
> Perhaps that's more
> useful. You'll do your query and quickly fetch all the leaf node
> values, and when you want to show facets above the leaf nodes just
> do some coalescing math. The performance should be good.
>
Yes, this sounds right. My assumption, however, has been that the coalescing
operations would be expensive if you have to do 100s of them in order to
generate a search result page. I saw enough good stuff at the MLUC that I'm
questioning that assumption.
> As an example, if you're modeling a biological taxonomy, you can quickly
> find the distinct number and count of animals matching any query, and then
> if you want to show mammals vs reptiles you walk the list of distinct
> animal matches and use your declarative document to figure out how many you
> have of each. Use the MarkLogic "map" API and I expect this will be very
> fast even for thousands of distinct animals which is probably more than
> you have in your case.
>
Yes, we are large-scale biomedical taxonomies that can have > 1M terms in them.
We're willing to put this on big hardware, so in-memory mapping sounds like a
good option. Just to make sure we're talking about the same thing, I'm
envisioning a model where a 1M document collection for veterinary medicne would
be tagged by animal and by disease. If a user searches for "Snake" and
"Infection", where "Infection" is marked as a synonym of infectious disease (it
isn't really a synonym, but ignore that), then we would show all matching
documents along with a faceted browsing UI that looks something like:
- Animal (1,000,000)
- Reptile (10,394)
* Snake (3,040)
- Cobra (1,000)
- Viper (1,500)
- Seasnake (500)
- Disease (1,000,000)
* Infectious Disease (250,000)
- Viral infection (111,000)
- Bacterial infection (139,000)
The * indicates where the user's search terms are in the hierarchy. This does
"coalescing" both above the term (e.g. shows the total number of Reptile
documents above the Snake entry), as well as immediately below the term (e.g.
shows the number of documents for each type of Reptile). There might be 50 such
child terms, so it seems to me that generating a small UI fragment like this
would potentially involve 100 aggregation operations. Am I thinking about this
right, based on the Option 2 example?
> If you want to limit a query to a certain parent node (i.e. reptiles), you'd
> use an or-query for
> the leaf nodes. That's how the thesaurus works in essence.
>
This is where I get stuck. I think I need 3 data structures, and I can't tell
which ones ML will give me. I need
1. a way to mark up documents with leaf terms.
2. a way to map leaf terms to parent terms.
3. a way to identify documents that match a given term or its children.
So in the above example, to calculate the 3,040 count of documents matching
Snake, I need to find all documents that are tagged Snake, Cobra, Viper or
Seasnake. So first I need to find the children of Snake, then find all
documents that mention any of those children, and then calculate the final
aggregate. Note that in the above example, Cobra + Viper + Seasnake counts =
3,000, implying that there are another 40 documents that mention "Snake"
directly.
Further, given that I also want to display summary counts for the immediate
children of Snake, I could re-do this for Cobra, for Viper, etc. Alternately,
since those aggregates had to be calculated along the way to calculating the
3,040 count for Snake, maybe there is a way to collect them along the way and
avoid recalculation.
Is there sample code available for doing this sort of pseudo-recursive
calculation efficiently, in particular showing how to do efficient lookups
involve calculations across 2 document sets (the actual documents and the
taxonomy document(s))?
> Option 3: Put the taxonomy hierarchy into a single string. Perhaps you'd
> have "reptile/snake/cobra" or something. This is similar to the option
> above but bakes the hierarchy into the documents again which is mentally
> simpler perhaps and has some query performance perks. For any given query
> you can get the distinct list of matching strings and you can easily do
> the math (again probably using map) for how many results have values
> starting with reptile vs starting with mammal.
>
> You can also then really easily limit your query to "reptile" by using
> a word-query or range-query against this field. If terms repeat in
> different places you can use an initial anchor word and a phrase search
> to make sure you're left-anchored.
>
This sounds like it might be worth combining with Option 2 for the above use
cases.
Ramon
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general