Hi Will,
On Sep 25, 2008, at 9:17 AM, Will Schenk wrote:
Hey all
I went to jchris' talk at columbia a week or so ago, and have been
playing around with couchdb every since. I'm not sure that I like
all of the high level design decisions (i for one like types, and I
think I'm running in this below) but I wanted to actually use it
because there certainly are some neat things. I have a couple of
newbie usage questions.
The scenario is that I'm building a spider-type thing and going to
be processing the "remote resource" into a specific localized
"document". So I'm going to be pulling in say 3 pages and from that
going to produce both a document that describes the resteraunt, has
it's menu and reviews, knows it's lat and long and going to serve
them up on a map. I'm following the architecture that I described
in http://benchcoach.com/papers/scraping and am basically
reimplementing menumaps as a proof of concept with couchdb.
I'm using merb and relaxdb at the moment, but I think I may need to
get a little lower level.
Question 1: Where do I store the original documents?
Right now I have a "RemoteUrl" document which contains the last-
modified, etag, encoding, and the content itself. (Its very
important in the design that I keep the original content and all
previous versions around.) For some reason, I can't store the
content directly -- I need to base64 encode it, which seems like a
problem with the ruby json library. But couchbd is slow when it has
all of these 200K documents sitting around in it. Is this not the
right sort of usage? I've created a map/reduce view for the "latest
revision" like this:
function(doc) {
if( doc.class == "RemoteUrl" && doc.content ) {
emit(doc.normalized_url, doc );
}
}
function(key,values,rereduce) {
if(rereduce) {
return values;
} else {
var max = values[0].created_at;
var doc = values[0];
for( i = 0; i < values.length; i++ ) {
if( values[i].created_at > max ) {
doc = values[i]; max=values[i].created_at;
}
}
}
return doc;
}
but it takes a long time to run and see. But this is probably
because it actually needs to load up all the data, send it over a
pipe to the javascript process, have it do it, send it back, and
repeat the process with the reduce step. Its pretty slow. Is there
a better way to do this? My guess is that if I make this into two
queries, and don't emit the doc.content (the actually content of the
file) that it would be a lot faster, but that seems pretty ugly.
I.e. just map [doc._id,doc.created_at] and then make another trip to
pull back the id. But you still have the problem of view creation
taking forever. Does anyone have any suggestions?
You should be able to get the latest revision without a reduce
(generally a good thing to avoid if you can). Something like
function(doc) {
if( doc.class == "RemoteUrl" && doc.content ) {
emit([doc.normalized_url, doc.created_at], doc);
}
}
will give you all your documents sorted first by URL and then by
revision time. Then you can query the view with some combination of
startkey, count, and maybe descending=true (depending on how your
revision dates sort) to get the latest revision of a particular doc.
Alternatively, if you wanted to suppress all old revisions in the view
you could add a simpler reduce function which takes advantage of the
map sorting the results for you:
function(keys, values) {
return values[0]; // or maybe values.pop();
}
By the way, you're using design documents and not _temp_views, right?
View index may currently be slow, but with a design doc you only have
to do it once.
You'll have to test and see whether it's better to emit the doc in the
view code or do a second trip to the DB to retrieve it. Both are
valid, and I think there's a patch in the works to add an
"include_docs" or similar parameter so that you can optionally
retrieve the associated document for any row of any view. Reduce
generally works best with small amounts of data.
Question 2: I'm using RelaxDB right now, and it only really wants to
work for one database per environment. Seeing as how slow couchbd
processing these documents, I was thinking that I'd want to keep the
bulk-data stuff in it's own database so that the other views won't
need to process over the whole data set. The views really work by
document "type", so there's no need to pump the huge amount of data
from the erlang process to the javascript process when all it's
doing is seeing that doc.class != "RemoteUrl". (Which is why I'd
want to have types, but no matter, I guess we can hack them on the
side like this!) I'm guessing that's why it's falling down. So if
I could spit out the "web-cache" database from "parsed" database I
think it would be a little faster. I'm wonder what people think
about this sort of design decision, and how they would suggest
implementing it.
If you don't need to analyze the bulk-data in any view you could
consider storing it as an attachment to a doc. Details are at the
bottom of this page:
http://wiki.apache.org/couchdb/HttpDocumentApi
Question 3: Say I eventually get something which has 2 floating
point attributes. (e.g. lat and long.) How would I get all of the
documents where those were in a square? i.e.
select * from places where places.lat >= left_lat and places.lat <=
right_lat and places.longitude >= top_longitude and places <=
bottom_longitude;
I can see how'd you'd do this with one dimension, but I'm not sure
how you'd do it with the second. Especially since you need to make
all these data round trips...
Yeah, I guess that's a bit tricky. If the data volume doesn't get in
your way you could emit [places.lat, places.long] as the key, query
the view with your latitude range as startkey and endkey, and then
pick out documents in your longitude range client-side. Others may
well have more clever suggestions. Best,
Adam
Thanks in advance. I think there are a lot of very interesting
ideas in Couchdb. It seems like a lot of this stuff can't be don't
nearly as well as with a sql database, but I'm hoping that it's just
me being ignorant.
-w
http://sublimeguile.com