I would use directories, because it makes the A vs B display easier and cheaper to implement. All you have to do is look at the document URI, which is automatically returned in search:search results. Also directories are naturally exclusive: a doc can only be in A or B, not both.
You could do it with collections but it feels less natural to me and would probably be a bit slower. For facets simply add a cts:directory-query with the appropriate prefix. There's no built-in word count feature. I would try to handle that at ingestion time by enriching the XML with a word-count attribute on every description element, plus a range index on description/@word-count as integer. That way you can easily query for ranges of word-counts, or sort by word count. You could implement the ingestion processing using cts:tokenize in an mlcp transform function: http://docs.marklogic.com/guide/ingestion/content-pump#id_82518 -- Mike > On Feb 13, 2015, at 23:20, Maisnam Ns <[email protected]> wrote: > > Hi, > > Please find below the sample xml. I am sorry for the long question. > > The file name is TR1078523.xml and the structure is below : > > <Purchase> > <transaction-id>TR1078523-6f568ef97904 </transaction-id> > <transaction-type></transaction-type> > > > <product> > <title>Frame T-shirt</title> > > <description>T-shirt with round neck with a rectangle inside a > rectangle inside a rectangle etc</description> > <product-id>PR1078523</product-id> > <product-item_group_id>tee21</product-item_group_id> > <product-condition>new</product-condition> > <product-availability>available for order</product-availability> > <product-price>20.00 AUD</product-price> > <product-brand>Blanc Ts</product-brand> > <product-size>L</product-size> > <product-image_link>http://ecommerce.com.ts/tee.png</product-image_link> > > <product-country>AUS</product-country> > <product-service>Standard</product-service> > <product-price>10.00 AUD</product-price> > </product-shipping> > > </product> > </Purchase> > > The above example is just from one file and there are many such files roughly > 500 GB in size. All files will have similar structure and some will probably > have more. > Probably there are more elements which I am omitting for various reasons. > > These files reside in one directory lets assume. > > Now my task is to store it into Marklogic. > > 1. Now I need to divide into two parts randomly of size part A 250 GB and > part B 250 GB(roughly) and store it into marklogic and the UI has options to > select from either A or B or both [checkboxes] > a) Now if A is checked, on the search page, files containing in Part A > will only be searched.Imagine if this file falls into Part A , then if the > user searches for transaction-id with 'TR1078523-6f568ef97904' then I need to > show this on the search page as there will be a hit. But if I select the B > checkbox , then it will show nothing since the above is not in part B. And > if I check both A and B [checkboxes ], I need to show this file and the > various fields. > > How should I store these 500 GB into two parts . Collections or > Directories.And which tool should I use. I am thinking of using mlcp, but > what if I want to store these as different collections . > > Next question: > > 2. Now if there are only two facets that I need to show on the UI page > 'product-country' and 'product-condition' , then these two elements will > have range indexes.So, I can get the counts easily for these facets. > > Now I want to query the count of product-size which is not part of facets by > 'L' [Large] in Part A only then what should be the query like. > > > > Now my understanding of marklogic at this point is that only those elements > which need to be shown as facets need to have a range index and not others. > So since product-size is not in facets so I am not creating the range > indexes. Is my understanding correct? > > My solution to get the counts of non-facets elements ,here is , as > product-size is not in facets but still I would be searching the product-size > by 'L' or 'M' or 'S' can I create a range index for product-size so that I > can get the counts easily. Or is it still possible to include product-size as > facets but while showing on the UI , I will show only the product-country and > product-condition and when the user queries for product-size , I will still > query for facets just to get the product-condition counts. [ May be cheating > here to get the counts] or is there a way to get the counts of an element e.g > product-size having 'L' as value. [ How should be the query look like ] > > 3. My question is how to let Marklogic know that I want the <description> > element 's values which contain bunch of words to be made available for word > counting . Is it possible in Marklogic. > > > The user will search for a string say 'Large' , marklogic will give us say > 10000 documents, and then from these 10000 documents I need to get the > <description> element's values containing those words and need to do a word > count. > > e.g In <description> lets say 'neck' is appearing the most with 2000 counts > likewise 'inside' with 100 and so on , I need to show > 1. neck (2000) > 2. 'inside(100) and so on. > > It's long but some one can put me in the right direction. > > Thanks > >> On Sat, Feb 14, 2015 at 1:20 AM, Michael Blakeley <[email protected]> wrote: >> Show us some XML. It's difficult to decipher what you mean without concrete >> examples. >> >> Don't rule out anything at this point. You may need a new range index. You >> may have to use XQuery. >> >> -- Mike >> >> > On 13 Feb 2015, at 10:50 , Maisnam Ns <[email protected]> wrote: >> > >> > Hi, >> > >> > Can someone help with this use case. >> > >> > I have a huge xml data in which product is one of the elements. I want to >> > find the top 10 products from these data. >> > >> > Product is not in the range index and will not be part of facets. >> > How to search this with JAVA API and not with xquery. >> > >> > Secondly, I need to divide the data into two parts. In marklogic there are >> > directories and collections. >> > >> > But how do I search a string from say part A if data is divided into part >> > A and part B.There is an option to select just from part A , part B and >> > both Part A and Part B. Depending on selection of options, if I select >> > Part A , the string has to search from Part A likewise for Part B and if >> > both A and B is selected it has to search from both A and B. >> > >> > Please let me know how to do this in java. A snippet of code will be >> > highly appreciated. >> > >> > And , information studio of Marklogic does not provide any option for >> > collections , it only provides for different directories. >> > >> > Thanks in advance >> > _______________________________________________ >> > General mailing list >> > [email protected] >> > http://developer.marklogic.com/mailman/listinfo/general >> >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
