Re: Map File index bug?

stack Thu, 06 Nov 2008 08:03:08 -0800

But its decompressed when its read into memory right?   So size will be the
same in memory whether it was compressed in the filesystem, or not?  Or am I
missing something Billy?
St.Ack


On Thu, Nov 6, 2008 at 7:55 AM, Billy Pearson <[EMAIL PROTECTED]>wrote:

> There is no method to change the compression of the index its just always
> block compressed.
> I hacked the code and and changed to non compressed so I could get a size
> of the index with out compression.
> Opening the all 80 mapfiles took 4x the memory then there uncompressed size
> of all the index files.
>
>
> "stack" <[EMAIL PROTECTED]> wrote in message
> news:[EMAIL PROTECTED]
>
>  On Wed, Nov 5, 2008 at 11:52 PM, Billy Pearson
>> <[EMAIL PROTECTED]>wrote:
>>
>>>
>>>
>>> I ran a job on 80 mapfile to write 80 new file with non compressed
>>> indexes
>>> and still took ~4X the memory of the sizes of the uncompressed index
>>> files
>>> to load in to memory
>>>
>>
>>
>> Sorry Billy, how did you specify non-compressed indices?  What took 4X
>> memory?  The non-compressed index?
>>
>>
>> could have to do with the way they grow the arrays storing the pos of the
>>
>>> keys starting on line 333
>>> Looks like they are copying arrays and making a new one 150% bigger then
>>> the last as needed.
>>> not sure about java and how long before the old array will be recovered
>>> from memory.
>>>
>>
>> I have seen a few times recover do to about ~2x the size of the
>> uncompressed
>>
>>> index files but only twice.
>>>
>>>
>>
>> Unreferenced java objects will be let go variously.  Depends  on your JVM
>> configuration.  Usually they'll be let go when JVM needs the memory (Links
>> like this may be of help:
>>
>> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom
>> )
>>
>>
>>
>>  I am testing by creating the files with a MR job and then loading the map
>>> files in a simple
>>> program to open the files and find midkey so the index gets read in to
>>> memory and watching top command
>>> also added -Xloggc:/tmp/gc.log and watch the memory usage go up and it
>>> matches for the most part with top.
>>>
>>> I tried running System.gc() to force a clean up of the memory but did not
>>> seam to help any.
>>>
>>>
>> Yeah, its just a suggestion.  The gc.log should give you better clue of
>> whats going on.  Whats it saying?  Lots of small gcs and then a Fulll gc
>> every so often?  Is the heap discernibly growing?  You could enable the
>> JMX
>> for the JVM and connect with jconsole.  This can give you a more detailed
>> picture on heap.
>>
>> St.Ack
>> P.S. Check out HBASE-722 if you have a sec.
>>
>>
>>
>>  Billy
>>>
>>>
>>> "Billy Pearson" <[EMAIL PROTECTED]> wrote in message
>>> news:[EMAIL PROTECTED]
>>>
>>>  I been looking over the MapFile class on hadoop for memory problems and
>>>
>>>> thank I might have found an index bug
>>>>
>>>> org.apache.hadoop.io.MapFile
>>>> line 202
>>>> if (size % indexInterval == 0) {            // add an index entry
>>>>
>>>> this is where its writing the index and skipping every indexInterval
>>>> rows
>>>>
>>>> then on the loading of the index
>>>> line 335
>>>>
>>>>        if (skip > 0) {
>>>>          skip--;
>>>>          continue;                             // skip this entry
>>>>
>>>> we are only reading in every skip entry
>>>>
>>>> so with the default of 32 I thank in hbase we are only writing a index
>>>> to
>>>> the indexfile every 32 rows and then only reading back every 32 rows of
>>>> that
>>>>
>>>> so we only get a index row every 1024 rows.
>>>>
>>>> Take a look and confirm and we can open a bug on hadoop about it.
>>>>
>>>> Billy
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Re: Map File index bug?

Reply via email to