On Feb 2, 1:50 pm, John Machin <[EMAIL PROTECTED]> wrote: > agc wrote: > > Hi, > > > I'm looking for a fast way of accessing some simple (structured) data. > > > The data is like this: > > Approx 6 - 10 GB simple XML files with the only elements > > I really care about are the <title> and <article> ones. > > > So what I'm hoping to do is put this data in a format so > > that I can access it as fast as possible for a given request > > (http request, Python web server) that specifies just the title, > > and I return the article content. > > > Is there some good format that is optimized for search for > > just 1 attribute (title) and then returning the corresponding article? > > > I've thought about putting this data in a SQLite database because > > from what I know SQLite has very fast reads (no network latency, etc) > > but not as fast writes, which is fine because I probably wont be doing > > much writing (I wont ever care about the speed of any writes). > > > So is a database the way to go, or is there some other, > > more specialized format that would be better? > > "Database" without any further qualification indicates exact matching, > which doesn't seem to be very practical in the context of titles of > articles. There is an enormous body of literature on inexact/fuzzy > matching, and lots of deployed applications -- it's not a Python-related > question, really.
Yes, you are right that in some sense this question is not truly Python related, but I am looking to solve this problem in a way that plays as nicely as possible with Python: I guess an important feature of what I'm looking for is some kind of mapping from *exact* title to corresponding article, i.e. if my data set wasn't so large, I would just keep all my data in a in-memory Python dictionary, which would be very fast. But I have about 2 million article titles mapping to approx. 6-10 GB of article bodies, so I think this would be just to big for a simple Python dictionary. Does anyone have any advice on the feasibility of using just an in memory dictionary? The dataset just seems to big, but maybe there is a related method? Thanks, Alex -- http://mail.python.org/mailman/listinfo/python-list