[google-appengine] Re: Expando and Index partitioning

Eli Tue, 03 Nov 2009 18:15:03 -0800

yes, I guess the inequality borks it.. I would have to pre-compute day
ranges..


This specific technique would probably be most useful for collecting
hit stats.. where each individual hit was inserted into the Expando
table.

So along with these columns (for june 2009):

  Number,June,y2009

You would also have (maybe):
  Yahoo,Hour

And it would look like this for an entry from the 15th:

  Number,June,y2009,Hour,Yahoo
  389,     15,     6,     1400, 1

so.. to get the hit count from Yahoo on June 15th 2009 you would do:

Select * from meStats Where y2009 = 6 AND June = 15 AND Yahoo = 1

Now, I could go all nutty and precompute date ranges to insert along
with the entries as well... but it might not be too much work to just
grab all the June days and pull the ones you wanted.

(This is just the first usage example that comes to mind.  This row
naming method could be used for all sorts of set intersection stuff,
and would cut down on insert times due to the fact that it should
partition out the indexes when dealing with humongous datasets).

I already figured out the nutty way I'd use exec to run the variable
put method (tested and works on my dev and live appengine).. I just
hacked this out real fast to see if I could get it to work..
num,monthStr,dayNum etc... are all variables fed into the function
that builds and execs this meStr string (I didn't incorporate the
Yahoo and Hour parts):


meStr = "meEntity = meStats(meNumber = " + str(num) + ", " + monthStr
+ " = " + str(dayNum) + ", " + yearStr + " = " + str(monthNum) + ")
\nmeEntity.put()\n"
exec meStr


Anyway, this was just a side thought I had while wondering what the
point of Expando was.. since it's so unstructured.. I couldn't imagine
why someone would want such an undependable datasource.. but, I can
see this method as being highly useful in a number of cases.. (again,
the main ultimate benefit I see is reducing insert times).

On Nov 3, 5:57 pm, Tim Hoffman <[email protected]> wrote:
> HI Eli
>
> Thats true there are many cases where you don't need composite
> indexes, as per the documentation I provided a link to.
> However the specific example you gave does.  (Now maybe you don't
> actually plan to use the specific example you provided
> but I don't have anything else to go on)
>
> Now I did actually try it myself before posting \ and the specific
> indexes I mentioned get created in index.yaml. And if you run the
> dev_server with  --require_indexes
> and the indexes in question are not present
>
> You get
>
> NeedIndexError: This query requires a composite index that is not
> defined. You must update the index.yaml file in your application root.
> This query needs this index:
> meStats
>   properties:
>   - name: y2009
>   - name: June
>
> Rgds
>
> T
>
> On Nov 4, 5:37 am, Eli <[email protected]> wrote:
>
>
>
> > I suggest you watch the IO talk where Brett Slatkin discusses Merge
> > Joins and pre-computing ranges.
>
> >http://www.youtube.com/watch?v=AgaL6NGpkB8
>
> > Watch the last half (past 34 min).. and maybe pay attention to the
> > section that's just after (41 minutes).
>
> > This implies you do not need composite indexes (or to create any new
> > indexes beyond the default ones) for all sorts of queries if you
> > construct your data in the right way.
>
> > I will test this out tonight to provide a proof of concept.
>
> > On Nov 3, 10:12 am, Tim Hoffman <[email protected]> wrote:
>
> > > Hi
>
> > > On Nov 3, 10:26 pm, Eli Jones <[email protected]> wrote:
>
> > > > I haven't done any testing on this yet since I'd have to fill up tens
> > > > of gigs of information to see real live performance numbers.
>
> > > > I'm hoping the implicit partitioning makes it so that one doesn't need
> > > > manually created indexes (just thedefault ones.)
>
> > > > The example I showed would be a schema for storing a daily int 
> > > > statistic.
>
> > > > The 'June' column entries would show the day of that month and the
> > > > 'y2009' column would have the 6 value since June is the 6th month of
> > > > the year.
>
> > > > If I wanted stats for June, my select would look like this:
>
> > > > Select * From meStats Where y2009 = 6 AND June > 15
>
> > > But the minute you do this ">" you will then need an index that looks
> > > like
>
> > > - kind: meStats
> > >   properties:
> > >   - name: y2009
> > >   - name: June
>
> > > and so on for every year month combination where you do a >
> > > comparison.
>
> > > I think you should have a read about how indexes are created and
> > > accessed before you try optimising something that probably doesn't
> > > need it.
>
> > > Note the rules from defining index 
> > > dochttp://code.google.com/appengine/docs/python/datastore/queriesandinde...
>
> > > Other forms of queries require their indexes to be specified in
> > > index.yaml, including:
>
> > >     * queries with multiple sort orders
> > >     * queries with a sort order on keys in descending order
> > >     * queries with one or more inequality filters on a property and
> > > one or more equality filters over other properties
> > >     * queries with inequality filters and ancestor filters
>
> > > You fall into the third rule. Which as I said eariler will mean you
> > > need to manually specify in index.yaml a massive number of indexes
>
> > > Rgds
>
> > > T
>
> > > > This would/should implicitly hit the june rows for 2009 and get the
> > > > stats for every day after the 15th.
>
> > > > You could munge around your column names and the values inserted to
> > > > get different data reporting behaviour..
>
> > > > The main, potential value is the implicit partitioning (where you
> > > > don't need to manually define a bunch of schemas up front).
>
> > > > On 11/3/09, Tim Hoffman <[email protected]> wrote:
>
> > > > > Hi
>
> > > > > Have you tried this?
>
> > > > > For starters you can't assign values to numbers.
>
> > > > > ie no matter what you do you can't assign 2009 = 'abc'
>
> > > > > You would need to use some other identifier as you mentioned and then
> > > > > specify something like
> > > > > year_2009 = db.IntegerProperty(name=2009) or something similiar.
>
> > > > > I also see a problem with this strategy with regard to index
> > > > > definitions.
> > > > > Whilst running the SDK the indexes will get created as you define data
> > > > > however once you are running
> > > > > in real google environment you will need to make sure you have already
> > > > > defined all possible indexes that you
> > > > > plan to use before you create any new data (or reindex everything),
> > > > > which means indexes for all years you plan to hold data for and
> > > > > search,
> > > > > and months, and combinations of the two.
>
> > > > > I am not sure this is a particularly good approach, but then I am not
> > > > > sure I get what you are actually doing.
>
> > > > > Have you compared the performance of lookups between the two
> > > > > strategies, also remembering if you are actually interested in year/
> > > > > month then you are
> > > > > actually using composite indexes,  I wonder if you will ever use the
> > > > > month only index (apart from comparing months with months for all
> > > > > years in no particular order)
>
> > > > > Rgds
>
> > > > > T
>
> > > > > On Nov 3, 12:22 am, Eli <[email protected]> wrote:
> > > > >> Here's something I've been wondering about Expando.
>
> > > > >> Say you define an Expando model like so:
>
> > > > >> class meStats(db.Expando):
> > > > >>     meNumber = db.IntegerProperty(required=True)
>
> > > > >> And, then you begin populating it like so:
>
> > > > >> meEntity1 = meStats(meNumber = 200,
> > > > >>                                 June          = 14,
> > > > >>                                 2009          = 6)
>
> > > > >> meEntity.put()
>
> > > > >> meEntity2 = meStats(meNumber = 381,
> > > > >>                                 July           = 21,
> > > > >>                                 2009          = 7)
>
> > > > >> meEntity2.put()
>
> > > > >> ..and so on.
>
> > > > >> The "July" column only has indexes for entities that have "July"
> > > > >> defined.. correct?  So, in effect, I am creating a partitioned index
> > > > >> for a table that can grow indefinitely.. and each time I get to a new
> > > > >> year/month combo, I am inserting into new indexes..? (instead of
> > > > >> inserting into an ever increasing, monolithic "Month" column index..)
>
> > > > >> Mainly, I'm packing the pertinent information into the column names
> > > > >> and column values (instead of making the column name just some dummy
> > > > >> value like "Month").. this allows me to implicitly create the
> > > > >> partitioned table/index (I think of it as a partitioned index since 
> > > > >> it
> > > > >> is, schematically [as far as I'm concerned], one table.)
>
> > > > >> You could give the columns better names.. maybe "June_Day" and maybe
> > > > >> "2009_Month" if you wanted...
>
> > > > >> Does this make sense?  Have I misunderstood how Expando handles
> > > > >> indexes?
>
> > > > >> Another way to word this question would be:
>
> > > > >> Is there a difference between the indexes created for the June and
> > > > >> July entries in the above Expando model and the below Model models:
>
> > > > >> class meJune09Stats(db.Model):
> > > > >>     meNumber = db.IntegerProperty(required=True)
> > > > >>     June = db.IntegerProperty(required=True)
> > > > >>     2009 = db.IntegerProperty(required=True)
>
> > > > >> class meJuly09Stats(db.Model):
> > > > >>     meNumber = db.IntegerProperty(required=True)
> > > > >>     July = db.IntegerProperty(required=True)
> > > > >>     2009 = db.IntegerProperty(required=True)
>
> > > > >> Thanks for any information.
>
> > > > --
> > > > Sent from my mobile device
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: Expando and Index partitioning

Reply via email to