Hi Barnabas, If I may suggest a way to approach this sort of question, I'd take a heapdump of an impalad and a catalogd (using "jmap") and then use Eclipse MAT or http://www.jxray.com/ to see if we're using Path. You'll want to load some tables and partitions ahead of time. Based on a little quick sleuthing (I'm not well-versed in this area of the code), fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java seems to use flat buffers to store these. It's likely we don't use the HDFS Path object in steady state.
I did a quick look on a cluster we have lying around and found negligible use at that moment, but I'm not totally confident about what's on that cluster at the moment. [root@... philip]# sudo -u impala /usr/java/jdk1.8.0_111/bin/jmap -histo 47116 > /tmp/histo [root@.... philip]# cat /tmp/histo | grep Path 85: 247 13832 sun.misc.URLClassPath$JarLoader 567: 3 144 sun.misc.URLClassPath 568: 6 144 sun.misc.URLClassPath$FileLoader 724: 4 96 sun.security.provider.certpath.X509CertPath 762: 2 80 sun.misc.URLClassPath$1 * 893: 4 64 org.apache.hadoop.fs.Path* 927: 2 64 sun.nio.fs.UnixPath 1030: 2 48 java.io.File$PathStatus 1211: 1 40 org.apache.hadoop.hdfs.protocol.proto.EncryptionZonesProtos$GetEZForPathRequestProto 1226: 1 40 sun.misc.URLClassPath$2 1411: 1 24 [Ljava.io.File$PathStatus; 1470: 1 24 com.sun.org.apache.bcel.internal.util.ClassPath 1645: 1 16 [Lcom.sun.org.apache.bcel.internal.util.ClassPath$PathEntry; 2035: 1 16 org.apache.hadoop.hdfs.protocol.proto.EncryptionZonesProtos$GetEZForPathRequestProto$1 [root@... philip]# head /tmp/histo num #instances #bytes class name ---------------------------------------------- 1: 416952 81879632 [B 2: 1324260 42376320 com.codahale.metrics.LongAdder 3: 794556 38138688 com.codahale.metrics.EWMA 4: 1060844 25460256 java.util.concurrent.atomic.AtomicLong 5: 264852 14831712 com.codahale.metrics.ExponentiallyDecayingReservoir 6: 264852 12712896 com.codahale.metrics.Meter 7: 264852 12712896 java.util.concurrent.ConcurrentSkipListMap -- Philip On Wed, Aug 1, 2018 at 8:19 AM Barnabás Maidics <barnabas.maid...@cloudera.com.invalid> wrote: > On Wed, Aug 1, 2018 at 11:17 AM Barnabás Maidics < > barnabas.maid...@cloudera.com> wrote: > > > Hi Everyone! > > > > I'm an intern at Cloudera and analysing where the memory goes in Hive. I > > was looking at a heapdump with many partitions, and found a memory waste, > > that comes from HDFS. > > > > We store paths in hadoop.fs.Path objects. This uses java.net.URI that > > stores almost the same strings in 3 different objects (see image and > > further explanation at the link given below). I think it's a waste of > > memory and it could be reduced by replacing the URI objects. This is why > > I've created an issue on HDFS side (HDFS-13752 > > <https://issues.apache.org/jira/browse/HDFS-13752>). > > > > I'm curious if you store these objects (hadoop.fs.Path), and if you do > how > > much it effects the overall memory usage of Impala. It may be beneficial > > for you as well, if it can be replaced. > > > > Thanks, > > > > Barnabas Maidics > > > > >