Hi Danny, OK great. Yes, that answers my question very well thank you. I was asking more about the theory but you took the extra step to try out a real world scenario, which is very helpful. I had actually asked a couple of these questions in Dec and didn't receive a response, so was beginning to lose faith that this board was going to be helpful, but with feedback like yours, it definitely is.
I find it interesting that the raw type of the stored data isn't that important when doing the optimization (ie. an index). I will be reading up on the performance tuning capabilities once all my core work is done, and I expect that will provide a lot of insight, but at this point I needed to know a bit of the theory behind data storage and optimization so I can start storing data in the proper format and not worry about having to convert it all at a later date due to optimization requirements. Furthermore, at this point I'm just going to store data in string format and for doing reporting will just convert it, which I previously wasn't sure was going to be a good approach, but certainly seems to work (with proper indexing). Quite a bit different than how I would be thinking with regards to an RDBMS. Thanks again! Mark On Jan 17, 2008 8:32 PM, Danny Sokolsky <[EMAIL PROTECTED]> wrote: > Hi Mark, > > It is true that it would take extra time to cast one or two million > times in a query. But it will take time to do anything that many times > in a query. The trick is to write the query in a such a way that it > does this fast. Range indexes are a good tool for this, in combination > with the order by optimizations. For example, if you want to find the > 10 latest dates from an element named stringdate, for example: > > <stringdate>2008-12-02</stringdate> > > then you can write a query like the following: > > (for $x in //stringdate order by xs:date($x) descending return $x)[1 to > 10] > > Without a range index, it will need to find all of the stringdates and > cast them all to dates in the order by clause. For a ballpark estimate, > on my laptop with 1,000,000 stringdate elements, this takes about 13 > seconds. Not bad considering it has to order 1 million items. > > Now if I add a date range index for this element, the same query takes > about 0.3 seconds, for a speedup of about 40x. That is because the > range index optimized the sort in the order by clause, and we just > returned the first 10 of them. For details about the order by > optimizations, see the Query Performance and Tuning book ( > http://developer.marklogic.com/pubs/3.2/books/performance.pdf). > > Another useful tool is the profile button in cq. It shows you where > your query is spending time processing. > > My recommendation is to try some tests with range indexes and order by > optimizations and see how it works. It is quite easy to generate some > dummy data for these tests. > > I'm not 100% sure I answered your question, but hopefully it will lead > you in the direction of what you are trying to accomplish. > > -Danny > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Mark > Waschkowski > > Sent: Thursday, January 17, 2008 11:38 AM > To: General Mark Logic Developer Discussion > Subject: Re: [MarkLogic Dev General] Type safe data and referencing > questions > > OK great, thanks for the information Danny. > > I'm a bit concerned about the type safety issue (#1) not because I'm > worried about the data being stored correctly, but because a > conversion might have to be carried out many many time during an > evaluation. I may be repeating the question here, but do you have any > idea of how the above use case would work with 1M+ rows of data. Seems > to me that converting some date text 2M+ times (twice per record in > this case) would have an adverse effect on a query, no? Likewise > converting when wanting to order a larger data set by date? > > Really appreciate the feedback. > > Mark > > On Jan 14, 2008 8:12 PM, Danny Sokolsky <[EMAIL PROTECTED]> wrote: > > Hi Mark, > > > > I will take a stab at your questions. > > > > 1) You do not need a schema to use typed data. A schema will make it > so > > Mark Logic treats an element or attribute as its defined type without > an > > explicit cast, but you can always add an explicit cast (like the > > use-case example) to make sure XQuery treats a value as a certain type > > (with or without a schema). The schema just makes that a little > easier. > > There might be some performance advantage to using a schema, but I > don't > > think it will be that big. It is worth trying though, as this might > > depend somewhat on your content. The real performance advantage will > > come from creating range indexes on elements or attributes you will > use > > in comparisons. Schemas can also help you ensure that your data is in > > the correct format when you load it, as Mark Logic will throw an > > exception if it cannot cast content in an element or attribute to the > > type specified in the schema. > > > > 2) You could put the referencing information in the properties > document. > > The default conversion application in CPF does this, for example, to > > keep track of the original documents and various converted documents. > > > > 3) There are no foreign key constraints built in. I think any best > > practices would depend on what you are trying to do. Two approaches > > that tend to work well are to a) put the constraining items in the > same > > document and/or b) use the properties document corresponding to a > > document to store information about what is in the document. > > > > -Danny > > > > > > -----Original Message----- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of Mark > > Waschkowski > > Sent: Monday, January 14, 2008 1:25 PM > > To: [email protected] > > Subject: [MarkLogic Dev General] Type safe data and referencing > > questions > > > > Hi, > > > > Have been using Marklogic for a while now and haven't seen answers to > > the below questions yet, anyone know of an answer or two? > > > > 1) Type safe data - I'm concerned with retrieval of typed data, > > especially for date information. The only way to store typed data is > > through the use of a schema right? I can't specify the type of data on > > a per element basis, correct? ie. <person> <birthday > > xs:date>01-01-1970</birthday></person> > > > > As well, I noticed the below query in the use case examples: > > > > let $item := doc("items.xml")//item_tuple > > [end_date >= xs:date("1999-03-01") > > and > > end_date <= xs:date("1999-03-31")] > > return > > <item_count> > > { > > count($item) > > } > > </item_count> > > > > Is there a schema behind the loaded data or are the examples un-type > > safe? Should I just not worry about type safety and convert the data > > values to the type I need when querying? If so, won't that be a > > performance issue? > > > > 2) Referencing - what is the (if there is one) best practice approach > > to reference documents together? > > ie. Document A and Document B should both refer to Document C > > > > 3) Foreign key constraints - is this supported at all in some fashion? > > If not, any approaches to suggest? > > > > Thanks in advance for any and all suggestions! > > > > Mark > > _______________________________________________ > > General mailing list > > [email protected] > > http://xqzone.com/mailman/listinfo/general > > _______________________________________________ > > General mailing list > > [email protected] > > http://xqzone.com/mailman/listinfo/general > > > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
