But its decompressed when its read into memory right? So size will be the same in memory whether it was compressed in the filesystem, or not? Or am I missing something Billy? St.Ack
On Thu, Nov 6, 2008 at 7:55 AM, Billy Pearson <[EMAIL PROTECTED]>wrote: > There is no method to change the compression of the index its just always > block compressed. > I hacked the code and and changed to non compressed so I could get a size > of the index with out compression. > Opening the all 80 mapfiles took 4x the memory then there uncompressed size > of all the index files. > > > "stack" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > > On Wed, Nov 5, 2008 at 11:52 PM, Billy Pearson >> <[EMAIL PROTECTED]>wrote: >> >>> >>> >>> I ran a job on 80 mapfile to write 80 new file with non compressed >>> indexes >>> and still took ~4X the memory of the sizes of the uncompressed index >>> files >>> to load in to memory >>> >> >> >> Sorry Billy, how did you specify non-compressed indices? What took 4X >> memory? The non-compressed index? >> >> >> could have to do with the way they grow the arrays storing the pos of the >> >>> keys starting on line 333 >>> Looks like they are copying arrays and making a new one 150% bigger then >>> the last as needed. >>> not sure about java and how long before the old array will be recovered >>> from memory. >>> >> >> I have seen a few times recover do to about ~2x the size of the >> uncompressed >> >>> index files but only twice. >>> >>> >> >> Unreferenced java objects will be let go variously. Depends on your JVM >> configuration. Usually they'll be let go when JVM needs the memory (Links >> like this may be of help: >> >> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom >> ) >> >> >> >> I am testing by creating the files with a MR job and then loading the map >>> files in a simple >>> program to open the files and find midkey so the index gets read in to >>> memory and watching top command >>> also added -Xloggc:/tmp/gc.log and watch the memory usage go up and it >>> matches for the most part with top. >>> >>> I tried running System.gc() to force a clean up of the memory but did not >>> seam to help any. >>> >>> >> Yeah, its just a suggestion. The gc.log should give you better clue of >> whats going on. Whats it saying? Lots of small gcs and then a Fulll gc >> every so often? Is the heap discernibly growing? You could enable the >> JMX >> for the JVM and connect with jconsole. This can give you a more detailed >> picture on heap. >> >> St.Ack >> P.S. Check out HBASE-722 if you have a sec. >> >> >> >> Billy >>> >>> >>> "Billy Pearson" <[EMAIL PROTECTED]> wrote in message >>> news:[EMAIL PROTECTED] >>> >>> I been looking over the MapFile class on hadoop for memory problems and >>> >>>> thank I might have found an index bug >>>> >>>> org.apache.hadoop.io.MapFile >>>> line 202 >>>> if (size % indexInterval == 0) { // add an index entry >>>> >>>> this is where its writing the index and skipping every indexInterval >>>> rows >>>> >>>> then on the loading of the index >>>> line 335 >>>> >>>> if (skip > 0) { >>>> skip--; >>>> continue; // skip this entry >>>> >>>> we are only reading in every skip entry >>>> >>>> so with the default of 32 I thank in hbase we are only writing a index >>>> to >>>> the indexfile every 32 rows and then only reading back every 32 rows of >>>> that >>>> >>>> so we only get a index row every 1024 rows. >>>> >>>> Take a look and confirm and we can open a bug on hadoop about it. >>>> >>>> Billy >>>> >>>> >>>> >>>> >>> >>> >> > >
