Re: 3 Newbie questions

Adam Kocoloski Thu, 25 Sep 2008 07:18:04 -0700

Hi Will,

On Sep 25, 2008, at 9:17 AM, Will Schenk wrote:

Hey all
I went to jchris' talk at columbia a week or so ago, and have beenplaying around with couchdb every since. I'm not sure that I likeall of the high level design decisions (i for one like types, and Ithink I'm running in this below) but I wanted to actually use itbecause there certainly are some neat things. I have a couple ofnewbie usage questions.
The scenario is that I'm building a spider-type thing and going tobe processing the "remote resource" into a specific localized"document". So I'm going to be pulling in say 3 pages and from thatgoing to produce both a document that describes the resteraunt, hasit's menu and reviews, knows it's lat and long and going to servethem up on a map. I'm following the architecture that I describedin http://benchcoach.com/papers/scraping and am basicallyreimplementing menumaps as a proof of concept with couchdb.
I'm using merb and relaxdb at the moment, but I think I may need toget a little lower level.
Question 1:  Where do I store the original documents?
Right now I have a "RemoteUrl" document which contains the last-modified, etag, encoding, and the content itself. (Its veryimportant in the design that I keep the original content and allprevious versions around.) For some reason, I can't store thecontent directly -- I need to base64 encode it, which seems like aproblem with the ruby json library. But couchbd is slow when it hasall of these 200K documents sitting around in it. Is this not theright sort of usage? I've created a map/reduce view for the "latestrevision" like this:
function(doc) {
  if( doc.class == "RemoteUrl" && doc.content ) {
    emit(doc.normalized_url, doc );
  }
 }

function(key,values,rereduce) {
 if(rereduce) {
   return values;
 } else {
   var max = values[0].created_at;
   var doc = values[0];
   for( i = 0; i < values.length; i++ ) {
    if( values[i].created_at > max ) {
     doc = values[i]; max=values[i].created_at;
     }
   }
 }
 return doc;
}
but it takes a long time to run and see. But this is probablybecause it actually needs to load up all the data, send it over apipe to the javascript process, have it do it, send it back, andrepeat the process with the reduce step. Its pretty slow. Is therea better way to do this? My guess is that if I make this into twoqueries, and don't emit the doc.content (the actually content of thefile) that it would be a lot faster, but that seems pretty ugly.I.e. just map [doc._id,doc.created_at] and then make another trip topull back the id. But you still have the problem of view creationtaking forever. Does anyone have any suggestions?

You should be able to get the latest revision without a reduce(generally a good thing to avoid if you can). Something like


function(doc) {
  if( doc.class == "RemoteUrl" && doc.content ) {
    emit([doc.normalized_url, doc.created_at], doc);
  }
}

will give you all your documents sorted first by URL and then byrevision time. Then you can query the view with some combination ofstartkey, count, and maybe descending=true (depending on how yourrevision dates sort) to get the latest revision of a particular doc.Alternatively, if you wanted to suppress all old revisions in the viewyou could add a simpler reduce function which takes advantage of themap sorting the results for you:


function(keys, values) {
  return values[0]; // or maybe values.pop();
}

By the way, you're using design documents and not _temp_views, right?View index may currently be slow, but with a design doc you only haveto do it once.

You'll have to test and see whether it's better to emit the doc in theview code or do a second trip to the DB to retrieve it. Both arevalid, and I think there's a patch in the works to add an"include_docs" or similar parameter so that you can optionallyretrieve the associated document for any row of any view. Reducegenerally works best with small amounts of data.

Question 2: I'm using RelaxDB right now, and it only really wants towork for one database per environment. Seeing as how slow couchbdprocessing these documents, I was thinking that I'd want to keep thebulk-data stuff in it's own database so that the other views won'tneed to process over the whole data set. The views really work bydocument "type", so there's no need to pump the huge amount of datafrom the erlang process to the javascript process when all it'sdoing is seeing that doc.class != "RemoteUrl". (Which is why I'dwant to have types, but no matter, I guess we can hack them on theside like this!) I'm guessing that's why it's falling down. So ifI could spit out the "web-cache" database from "parsed" database Ithink it would be a little faster. I'm wonder what people thinkabout this sort of design decision, and how they would suggestimplementing it.

If you don't need to analyze the bulk-data in any view you couldconsider storing it as an attachment to a doc. Details are at thebottom of this page:


http://wiki.apache.org/couchdb/HttpDocumentApi

Question 3: Say I eventually get something which has 2 floatingpoint attributes. (e.g. lat and long.) How would I get all of thedocuments where those were in a square? i.e.select * from places where places.lat >= left_lat and places.lat <=right_lat and places.longitude >= top_longitude and places <=bottom_longitude;
I can see how'd you'd do this with one dimension, but I'm not surehow you'd do it with the second. Especially since you need to makeall these data round trips...

Yeah, I guess that's a bit tricky. If the data volume doesn't get inyour way you could emit [places.lat, places.long] as the key, querythe view with your latitude range as startkey and endkey, and thenpick out documents in your longitude range client-side. Others maywell have more clever suggestions. Best,


Adam

Thanks in advance. I think there are a lot of very interestingideas in Couchdb. It seems like a lot of this stuff can't be don'tnearly as well as with a sql database, but I'm hoping that it's justme being ignorant.
-w
http://sublimeguile.com

Re: 3 Newbie questions

Reply via email to