Yes that's what i ended up doing. I will probably "fork" a new java vm instead of doing it in the same. That way i can control the memory requirements, though it hasn't given me any problems (actually it even worked with -Xmx, though it probably doesn't if i do something else in the program at the same time - i'm not indexing the book subjects yet too, need to do some sort of string caching for that and authors.)
On Sun, Oct 31, 2010 at 8:47 PM, Erick Erickson <erickerick...@gmail.com>wrote: > Hmmmm. Are you too memory limited to do a first pass through the file and > save > the key/download links part in a map, then make another pass through the > file > indexing the data and grabbing the link from your map? I'm assuming that > there's a lot less than 200M in just the key/link part. > > Alternatively (and this would probably be kinda slow, but...) still do a 2 > pass > process, but instead of making a map, put the data in a Lucene index on > disk. > Then the second pass searches that index for the data to add to the docs > in your "real" index. > > HTH > Erick > > On Sun, Oct 31, 2010 at 12:17 PM, Paulo Levi <i30...@gmail.com> wrote: > > > I'm stepping tru a rdf file (the project gutenberg catalog) and sending > > data > > to a lucene index to allow searches of titles authors and such. However > the > > gutenberg rdf is a little bit "special". It has two sections, one for > > title, > > authors, collaborators and such, and (after all the books) starts the > other > > section that has the download links. The connection is a kind of foreign > > key > > that exists on both tags (a unique number id). While i don't need to > search > > the download link, i do need to save it. > > > > I'm memory limited and can't put in memory the 200 mb file that the > catalog > > is. I'm wondering if there is some way for me to use the number id to > > connect both kinds of information without having to keep things in > memory. > > A > > first search for the things i want, and a second using the number id? > > It seems very clumsy. I'm not actually using a database, and i don't want > > to > > use very large libraries. Compass is 60 mbs (!). I tried lucenesail for a > > while, but it has stopped working and the code is a mess (it is not > adapted > > to the filtering of the gutemberg rdf that i'm doing). > > >