The idea is just to put a layer on top of the abstract file system function supplied by directory. Whenever somebody wants to create a file and write data to it, the methods create more than one file and switch e.g. after 10 Megabytes to another file. E.g. look into MMapDirectory that uses MMap to map files into address space. Because MappedByteBuffer only supports 32 bit offsets, there will be created different mappings for the same file (the file is splitted up into parts of 2 Gigabytes). You could use similar code here and just use another file, if somebody seeks or writes above the 10 MiB limit. Just "virtualize" the files.
----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > From: Dvora [mailto:barak.ya...@gmail.com] > Sent: Thursday, September 10, 2009 1:23 PM > To: java-user@lucene.apache.org > Subject: Re: How to avoid huge index files > > > Hi again, > > Can you add some details and guidelines how to implement that? Different > files types have different structure, is such spliting doable without > knowing Lucene internals? > > > Michael McCandless-2 wrote: > > > > You're welcome! > > > > Another, bottoms-up option would be to make a custom Directory impl > > that simply splits up files above a certain size. That'd be more > > generic and more reliable... > > > > Mike > > > > On Thu, Sep 10, 2009 at 5:26 AM, Dvora <barak.ya...@gmail.com> wrote: > >> > >> Hi, > >> > >> Thanks a lot for that, will peforms the experiments and publish the > >> results. > >> I'm aware to the risk of peformance degredation, but for the pilot I'm > >> trying to run I think it's acceptable. > >> > >> Thanks again! > >> > >> > >> > >> Michael McCandless-2 wrote: > >>> > >>> First, you need to limit the size of segments initially created by > >>> IndexWriter due to newly added documents. Probably the simplest way > >>> is to call IndexWriter.commit() frequently enough. You might want to > >>> use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently > >>> consumed by IndexWriter's buffer to determine when to commit. But it > >>> won't be an exact science, ie, the segment size will be different from > >>> the RAM buffer size. So, experiment w/ it... > >>> > >>> Second, you need to prevent merging from creating a segment that's too > >>> large. For this I would use the setMaxMergeMB method of the > >>> LogByteSizeMergePolicy (which is IndexWriter's default merge policy). > >>> But note that this max size applies to the *input* segments, so you'd > >>> roughly want that to be 1.0 MB (your 10.0 MB divided by the merge > >>> factor = 10), but probably make it smaller to be sure things stay > >>> small enough. > >>> > >>> Note that with this approach, if your index is large enough, you'll > >>> wind up with many segments and search performance will suffer when > >>> compared to an index that doesn't have this max 10.0 MB file size > >>> restriction. > >>> > >>> Mike > >>> > >>> On Thu, Sep 10, 2009 at 2:32 AM, Dvora <barak.ya...@gmail.com> wrote: > >>>> > >>>> Hello again, > >>>> > >>>> Can someone please comment on that, whether what I'm looking is > >>>> possible > >>>> or > >>>> not? > >>>> > >>>> > >>>> Dvora wrote: > >>>>> > >>>>> Hello, > >>>>> > >>>>> I'm using Lucene2.4. I'm developing a web application that using > >>>>> Lucene > >>>>> (via compass) to do the searches. > >>>>> I'm intending to deploy the application in Google App Engine > >>>>> (http://code.google.com/appengine/), which limits files length to be > >>>>> smaller than 10MB. I've read about the various policies supported by > >>>>> Lucene to limit the file sizes, but on matter which policy I used > and > >>>>> which parameters, the index files still grew to be lot more the > 10MB. > >>>>> Looking at the code, I've managed to limit the cfs files (predicting > >>>>> the > >>>>> file size in CompoundFileWriter before closing the file) - I guess > >>>>> that > >>>>> will degrade performance, but it's OK for now. But now the FDT files > >>>>> are > >>>>> becoming huge (about 60MB) and I cant identifiy a way to limit those > >>>>> files. > >>>>> > >>>>> Is there some built-in and correct way to limit these files length? > If > >>>>> no, > >>>>> can someone direct me please how should I tweak the source code to > >>>>> achieve > >>>>> that? > >>>>> > >>>>> Thanks for any help. > >>>>> > >>>> > >>>> -- > >>>> View this message in context: > >>>> http://www.nabble.com/How-to-avoid-huge-index-files- > tp25347505p25378056.html > >>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>> > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >>> > >>> > >> > >> -- > >> View this message in context: > >> http://www.nabble.com/How-to-avoid-huge-index-files- > tp25347505p25380052.html > >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > -- > View this message in context: http://www.nabble.com/How-to-avoid-huge- > index-files-tp25347505p25381489.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org