Thanks Michael for your suggestions. On Mon, Feb 16, 2015 at 4:04 AM, Michael Blakeley <[email protected]> wrote:
> I would use directories, because it makes the A vs B display easier and > cheaper to implement. All you have to do is look at the document URI, which > is automatically returned in search:search results. Also directories are > naturally exclusive: a doc can only be in A or B, not both. > > You could do it with collections but it feels less natural to me and would > probably be a bit slower. > > For facets simply add a cts:directory-query with the appropriate prefix. > > There's no built-in word count feature. I would try to handle that at > ingestion time by enriching the XML with a word-count attribute on every > description element, plus a range index on description/@word-count as > integer. That way you can easily query for ranges of word-counts, or sort > by word count. You could implement the ingestion processing using > cts:tokenize in an mlcp transform function: > http://docs.marklogic.com/guide/ingestion/content-pump#id_82518 > > -- Mike > > On Feb 13, 2015, at 23:20, Maisnam Ns <[email protected]> wrote: > > Hi, > > Please find below the sample xml. I am sorry for the long question. > > The file name is TR1078523.xml and the structure is below : > > <Purchase> > <transaction-id>TR1078523-6f568ef97904 </transaction-id> > <transaction-type></transaction-type> > > > <product> > <title>Frame T-shirt</title> > > <description>T-shirt with round neck with a rectangle inside a > rectangle inside a rectangle etc</description> > <product-id>PR1078523</product-id> > <product-item_group_id>tee21</product-item_group_id> > <product-condition>new</product-condition> > <product-availability>available for order</product-availability> > <product-price>20.00 AUD</product-price> > <product-brand>Blanc Ts</product-brand> > <product-size>L</product-size> > <product-image_link>http://ecommerce.com.ts/tee.png > </product-image_link> > > <product-country>AUS</product-country> > <product-service>Standard</product-service> > <product-price>10.00 AUD</product-price> > </product-shipping> > > </product> > </Purchase> > > The above example is just from one file and there are many such files > roughly 500 GB in size. All files will have similar structure and some will > probably have more. > Probably there are more elements which I am omitting for various reasons. > > These files reside in one directory lets assume. > > Now my task is to store it into Marklogic. > > 1. Now I need to divide into two parts randomly of size part A 250 GB and > part B 250 GB(roughly) and store it into marklogic and the UI has options > to select from either A or B or both [checkboxes] > a) Now if A is checked, on the search page, files containing in Part A > will only be searched.Imagine if this file falls into Part A , then if the > user searches for transaction-id with 'TR1078523-6f568ef97904' then I need > to show this on the search page as there will be a hit. But if I select the > B checkbox , then it will show nothing since the above is not in part B. > And if I check both A and B [checkboxes ], I need to show this file and the > various fields. > > How should I store these 500 GB into two parts . Collections or > Directories.And which tool should I use. I am thinking of using mlcp, but > what if I want to store these as different collections . > > Next question: > > 2. Now if there are only two facets that I need to show on the UI page > 'product-country' and 'product-condition' , then these two elements will > have range indexes.So, I can get the counts easily for these facets. > > Now I want to query the count of product-size which is not part of facets > by 'L' [Large] in Part A only then what should be the query like. > > > > Now my understanding of marklogic at this point is that only those > elements which need to be shown as facets need to have a range index and > not others. So since product-size is not in facets so I am not creating the > range indexes. Is my understanding correct? > > My solution to get the counts of non-facets elements ,here is , as > product-size is not in facets but still I would be searching the > product-size by 'L' or 'M' or 'S' can I create a range index for > product-size so that I can get the counts easily. Or is it still possible > to include product-size as facets but while showing on the UI , I will show > only the product-country and product-condition and when the user queries > for product-size , I will still query for facets just to get the > product-condition counts. [ May be cheating here to get the counts] or is > there a way to get the counts of an element e.g product-size having 'L' as > value. [ How should be the query look like ] > > 3. My question is how to let Marklogic know that I want the <description> > element 's values which contain bunch of words to be made available for > word counting . Is it possible in Marklogic. > > > The user will search for a string say 'Large' , marklogic will give us say > 10000 documents, and then from these 10000 documents I need to get the > <description> element's values containing those words and need to do a word > count. > > e.g In <description> lets say 'neck' is appearing the most with 2000 > counts likewise 'inside' with 100 and so on , I need to show > 1. neck (2000) > 2. 'inside(100) and so on. > > It's long but some one can put me in the right direction. > > Thanks > > On Sat, Feb 14, 2015 at 1:20 AM, Michael Blakeley <[email protected]> > wrote: > >> Show us some XML. It's difficult to decipher what you mean without >> concrete examples. >> >> Don't rule out anything at this point. You may need a new range index. >> You may have to use XQuery. >> >> -- Mike >> >> > On 13 Feb 2015, at 10:50 , Maisnam Ns <[email protected]> wrote: >> > >> > Hi, >> > >> > Can someone help with this use case. >> > >> > I have a huge xml data in which product is one of the elements. I want >> to find the top 10 products from these data. >> > >> > Product is not in the range index and will not be part of facets. >> > How to search this with JAVA API and not with xquery. >> > >> > Secondly, I need to divide the data into two parts. In marklogic there >> are directories and collections. >> > >> > But how do I search a string from say part A if data is divided into >> part A and part B.There is an option to select just from part A , part B >> and both Part A and Part B. Depending on selection of options, if I select >> Part A , the string has to search from Part A likewise for Part B and if >> both A and B is selected it has to search from both A and B. >> > >> > Please let me know how to do this in java. A snippet of code will be >> highly appreciated. >> > >> > And , information studio of Marklogic does not provide any option for >> collections , it only provides for different directories. >> > >> > Thanks in advance >> > _______________________________________________ >> > General mailing list >> > [email protected] >> > http://developer.marklogic.com/mailman/listinfo/general >> >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general >> > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > >
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
