David,
I'm glad I asked why you're searching two documents. :)
If you have 20,000 top level elements in a large document, and you also wish to
return those elements as results, then to me it seems like you probably want to
load them as individual documents. (To me your "document" sounds more like an
archive than a document.) There are several reasons for this.
1) the performance of queries will be better
2) relevance is determined at the document level, so all the matching nodes in
your large document will have the same score
3) you may be able to use the searchAPI to do most of what you're trying to do
If you follow this approach, you could group all these documents in a common
collection or directory. Or, perhaps they all have a common root node. Keep in
mind if you do:
cts:search(/my-root-node,$query)
You will only search documents with that root node. No need to specify a
specific collection or document name. You can also do:
cts:search(collection("my-collection")/my-root-node,$query)
or
cts:search(xdmp:directory("/my-directory/","infinity")/my-root-node,$query)
In both of these approaches, the collection and directory constraints will be
combined with the root node constraint.
As Geert says, you can use cts:collection-query() and cts:directory-query()
instead, if you want to group all of your constraints in a cts:query, which I
think is generally a best practice. However, it is a little easier to explain
the XPath part of the constraints with the syntax above. :)
To load these individual nodes as documents, you should take a look at
RecordLoader - that's probably the easiest and most efficient way.
Now, there is a concept in MarkLogic called fragmentation which allows you to
store very large documents, and to perform minimal disk IO when retrieving or
updating the individual fragments. This is a very useful feature. However, for
search applications, the best practice is to load the individual nodes as
documents. If there is metadata that applies to all your individual nodes, then
we can talk about how you might deal with that.
I think one lesson here is that it is a good idea to step back and talk about
the bigger picture - what are you trying to do in your application? This will
help us to recommend the best approach. Right now it still seems mysterious. :)
Kelly
Thanks for the suggestion.
Collections are probably what I want to use, but I'm just experimenting to try
to learn.
In my case I have 2 documents (both large, about 500MB each, with about 20,000
top level elements each) where I want to do a single search and produce a
single set of results ordered by score.
This will lead to a page which shows "Top 10 results" where if you click on
the link it will go to the detail for that record. These documents are from
the same category of data, but are totally different
schemas. They happen right now to reside in a directory with 3 other
documents I do NOT want to search. So my choices are
1) Move these 2 into their own private directory
2) Create a collection with these 2 docs (learning how to do that now, thanks
to the group !)
3) Search the 2 docs seperately and try to combine the results (very ugly , all
sorts of extra work needed to combine the results)
4) Search the 2 documents in ONE search. Thanks to the suggestion of
using multiple documents to fn:doc() I found this search works *exactly how I
want*
The trick here is that I want a particular sub-record returned as the result,
not the entire document, so hence the funky xpath expression after doc() to
pick the 2 kinds of sub record sets I want searched.
cts:search(
fn:doc(("/Mesh/desc2010.xml","/Mesh/supp2010.xml"))/
(DescriptorRecordSet/DescriptorRecord|SupplementalRecordSet/Supplemental
Record),
"Codeine")
This Works !!!!!
And it beats the hell out of my 3 solution which is
cts:search( doc("doc1") ) | cts:search(doc("doc2")) ...
because it seems like now the results are combined which means
remainder() should work on the combined set and I dont have to re-order the
results !
I will definitely look into making a collection and/or separating these
2 docs into their own directory. But good to know I can do this way too.
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general