3 Newbie questions

Will Schenk Thu, 25 Sep 2008 06:25:38 -0700

Hey all

I went to jchris' talk at columbia a week or so ago, and have beenplaying around with couchdb every since. I'm not sure that I like allof the high level design decisions (i for one like types, and I thinkI'm running in this below) but I wanted to actually use it becausethere certainly are some neat things. I have a couple of newbie usagequestions.

The scenario is that I'm building a spider-type thing and going to beprocessing the "remote resource" into a specific localized "document". So I'm going to be pulling in say 3 pages and from that going toproduce both a document that describes the resteraunt, has it's menuand reviews, knows it's lat and long and going to serve them up on amap. I'm following the architecture that I describedin http://benchcoach.com/papers/scraping and am basicallyreimplementing menumaps as a proof of concept with couchdb.

I'm using merb and relaxdb at the moment, but I think I may need to geta little lower level.


Question 1:  Where do I store the original documents?

Right now I have a "RemoteUrl" document which contains thelast-modified, etag, encoding, and the content itself. (Its veryimportant in the design that I keep the original content and allprevious versions around.) For some reason, I can't store the contentdirectly -- I need to base64 encode it, which seems like a problem withthe ruby json library. But couchbd is slow when it has all of these200K documents sitting around in it. Is this not the right sort ofusage? I've created a map/reduce view for the "latest revision" likethis:


function(doc) { 
  if( doc.class == "RemoteUrl" && doc.content ) { 
    emit(doc.normalized_url, doc ); 
  }
 }

function(key,values,rereduce) {
 if(rereduce) {
   return values;
 } else {
   var max = values[0].created_at;
   var doc = values[0];
   for( i = 0; i < values.length; i++ ) {
    if( values[i].created_at > max ) {
     doc = values[i]; max=values[i].created_at;
     }
   }
 }
 return doc;
}

but it takes a long time to run and see. But this is probably becauseit actually needs to load up all the data, send it over a pipe to thejavascript process, have it do it, send it back, and repeat the processwith the reduce step. Its pretty slow. Is there a better way to dothis? My guess is that if I make this into two queries, and don't emitthe doc.content (the actually content of the file) that it would be alot faster, but that seems pretty ugly. I.e. just map[doc._id,doc.created_at] and then make another trip to pull back theid. But you still have the problem of view creation taking forever. Does anyone have any suggestions?

Question 2: I'm using RelaxDB right now, and it only really wants towork for one database per environment. Seeing as how slow couchbdprocessing these documents, I was thinking that I'd want to keep thebulk-data stuff in it's own database so that the other views won't needto process over the whole data set. The views really work by document"type", so there's no need to pump the huge amount of data from theerlang process to the javascript process when all it's doing is seeingthat doc.class != "RemoteUrl". (Which is why I'd want to have types,but no matter, I guess we can hack them on the side like this!) I'mguessing that's why it's falling down. So if I could spit out the"web-cache" database from "parsed" database I think it would be alittle faster. I'm wonder what people think about this sort of designdecision, and how they would suggest implementing it.

Question 3: Say I eventually get something which has 2 floating pointattributes. (e.g. lat and long.) How would I get all of the documentswhere those were in a square? i.e.select * from places where places.lat >= left_lat and places.lat <=right_lat and places.longitude >= top_longitude and places <=bottom_longitude;

I can see how'd you'd do this with one dimension, but I'm not sure howyou'd do it with the second. Especially since you need to make allthese data round trips...

Thanks in advance. I think there are a lot of very interesting ideasin Couchdb. It seems like a lot of this stuff can't be don't nearly aswell as with a sql database, but I'm hoping that it's just me beingignorant.


-w
http://sublimeguile.com

3 Newbie questions

Reply via email to