Hi Renato, I will follow Alfonso's recommendations about reusing objects as much as I can. I will push those changes to the branch by the end of this week.
To answer your questions. Yes, you are right I am using a clean cold JVM. If necessary, I can also have a look at warming the JVM down the line. Yes, I have tried setting *gora.hbasestore.scanner.caching* to different values but there was no significant difference. Also, I may be wrong but I think this setting has to do with scan operation and not insert operation? As for flushing, I tried but it quickly throws an error and hence I commented that line of code. I think this is due to the fact that the insert operation inserts a single user object for each call, so calling dataStore.flush() within that method would mean calling flush on every object insertion. Is that not the case? There should be a way to track the progress of inserts then that can be used to call flush after N insert calls. So I used *gora.hbasestore.hbase.client.autoflush.enabled=true *which would automatically call flush at some point. However, like I mentioned in my previous email, enabling autoflush decreases write performance [1]. [1] https://gora.apache.org/current/gora-hbase.html Thank you. **Sheriffo Ceesay** On Tue, Jun 11, 2019 at 10:52 PM Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com> wrote: > Hey Sheriffo, > > Cool to hear you are making progress! :) and great to see that we have > some numbers already! :) > Regarding optimization point (1), regardless that this was not he > cause of the issue or not, Alfonso suggestions are something we should > follow, many objects with a short life in java might create a > performance problem sooner or later. Also about your comment: > > "Also, I may be wrong but the way I understand YCSB framework is, it > will execute an insert operation for each user object, so I thought it > was right to create a user object within the insert method." > > As you pointed out, YCSB is about inserting the objects, and NOT about > creating them, so it doesn't matter if we reuse the objects, as long > as the values that we insert are actually correct. We don't want to > end up measuring object creation+gc. I think Alfonso's comment was > hinting on that direction (please feel free to correct me @Alfonso if > I am misunderstanding you) and I think his comments are just on the > spot. > I have some other questions regarding the numbers you sent around: > - are you running YCSB for each data store with warm JVM? or are these > numbers each with a clean cold JVM? I suppose the latter, right? > - did you try setting gora.hbasestore.scanner.caching to a lower value? > - which is the command that you are using to run/start this code? > - did you try flushing the commits more regularly in: > > https://github.com/sneceesay77/gora/blob/GORA-532/gora-benchmark/src/main/java/org/apache/gora/benchmark/GoraBenchmarkClient.java#L142 > let's say every 1000 elements? or something like that? I mean instead > of at the end of the 1M elements? > > Thanks a lot for the report Sheriffo! > > > Best, > > Renato M. > > El mar., 11 jun. 2019 a las 16:12, Sheriffo Ceesay > (<sneceesa...@gmail.com>) escribió: > > > > Hello, > > > > I have taken a proper look at the recommendations from @Alfonso and > @Renato and below are the outcomes. > > > > Failed Attempts > > 1. Optimisation, for the insert operation, to avoid the concatenation > issue, I have just taken the quickest route by calling the methods directly > without reflection. Below are those calls. Note: I have moved all reusable > codes to the init method. > > > >> public int insert(String table, String key, HashMap<String, > ByteIterator> values) { > >> user.setField0(values.get("field0").toString()); > >> user.setField1(values.get("field1").toString()); > >> user.setField2(values.get("field2").toString()); > >> user.setField3(values.get("field3").toString()); > >> user.setField4(values.get("field4").toString()); > >> user.setField5(values.get("field5").toString()); > >> user.setField6(values.get("field6").toString()); > >> user.setField7(values.get("field7").toString()); > >> user.setField8(values.get("field8").toString()); > >> user.setField9(values.get("field9").toString()); > >> dataStore.put(user.getUserId().toString(), user); > >> } catch (Exception e) { > >> return FAILED; > >> } > >> return SUCCESS; > >> } > > > > > > if the above had worked, I would have changed the code as suggested by > Alfonso. Also, I may be wrong but the way I understand YCSB framework is, > it will execute an insert operation for each user object, so I thought it > was right to create a user object within the insert method. > > > > > > 2. I used different config values for -Xmx (256MB, 512MB, 1GB, 2GB) and > even disabled GC checking using -XX:-UseGCOverheadLimit but they all failed > with the same GC error. > > > > Successful Attempt -- There may be room for improvement > > Using the configurations below worked but I think it is not the best for > write performance. > > > > First, I read from [1] related to [2] that the following oneliner code > should be executed for better HBase performance when using YCSB. It > basically avoids overloading a single region server. > > > > hbase(main):001:0> n_splits = 200 # HBase recommends (10 * number of > regionservers) > > hbase(main):002:0> create 'users', 'info', {SPLITS => (1..n_splits).map > {|i| "user#{1000+i*(9999-1000)/n_splits}"}} > > > > Second, as suggested by @Renato Marroquín Mogrovejo , it only works when > I set > > > > hbase.client.autoflush.default=true > > > > However, from [3], I found "HBase autoflushing. Enabling autoflush > decreases write performance. Available since Gora 0.2. Defaults to > disabled.". So I am of the opinion that the problem is not entirely solved. > > > > I have done the following testing to insert 1M records into MongoDB and > HBase, so I think this may not be bad after all but more benchmarks may be > required to validate this. HBase in Gora has almost the same performance as > vanilla YCSB to benchmark it. > > > > Backend Ave Time Taken (sec) > > MongoDB ~90 > > HBase in Gora ~160 > > HBase YCSB ~160 > > > > > > [1] https://github.com/brianfrankcooper/YCSB/tree/master/hbase098 > > [2] https://issues.apache.org/jira/browse/HBASE-4163 > > [3] https://gora.apache.org/current/gora-hbase.html > > > > Comments are welcomed. > > > > Thank you. > > *Sheriffo Ceesay* > > > > > > > > On Tue, Jun 11, 2019 at 12:04 AM Sheriffo Ceesay <sneceesa...@gmail.com> > wrote: > >> > >> Hello Alfonso and Renato, > >> > >> Thank you for getting in touch and thanks for the detailed replies. > >> > >> I will have proper look at this tomorrow morning. I did some > troubleshooting yesterday (mostly playing with Xmx and zookeeper timeout > settings), that improved the conditions, but it did not entirely solve the > problem. Preliminary, it seems the problem has to do with configuration or > how HBaseStore is implemented (this may not be entirely true). > >> > >> I will keep you all posted whenever I thoroughly have a look at your > suggestions. > >> > >> Thanks again. > >> > >> *Sheriffo Ceesay* > >> > >> > >> > >> On Mon, Jun 10, 2019 at 11:14 PM Alfonso Nishikawa < > alfonso.nishik...@gmail.com> wrote: > >>> > >>> Hi! > >>> > >>> My hypothesis is taht that the difference between MongoDB and HBase is > that > >>> HBase put more stress serializing with avro. It could affect too that > if > >>> the HBase's test is performed after MongoDB's ones, then the GC starts > from > >>> a "bad" situation. > >>> > >>> From [A] linked by @Renato, if the error was OutOfMemoryException I > would > >>> have recommended lowering gora.hbasestore.scanner.caching to 100, 10 or > >>> even 1, but with a GC error I am not that much sure. In anycase, > @Sheriffo: > >>> you can try this if with the optimizations still doesn't work :) > >>> > >>> @Renato: Thx for the links! > >>> > >>> Regards, > >>> > >>> Alfonso Nishikawa > >>> > >>> > >>> > >>> El lun., 10 jun. 2019 a las 22:02, Renato Marroquín Mogrovejo (< > >>> renatoj.marroq...@gmail.com>) escribió: > >>> > >>> > @Alfonso, > >>> > Thank you very much for the suggestions! you are totally right about > >>> > all of your points! Sheriffo, please benefit from them ;) > >>> > > >>> > Also what is strange is this (although it can be optimized as Alfonso > >>> > pointed out) is that it works for the MongoDB backend. So I would > also > >>> > suspect on the configuration of the Gora-HBase client. Have you taken > >>> > a look at [A] for example? or other Gora-HBase assumed configurations > >>> > [B]? Maybe there you can specify some Xmx / Xms config. > >>> > > >>> > > >>> > Best, > >>> > > >>> > Renato M. > >>> > > >>> > [A] > >>> > > https://github.com/sneceesay77/gora/blob/master/gora-hbase/src/test/conf/gora.properties > >>> > [B] > >>> > > https://github.com/sneceesay77/gora/blob/master/gora-hbase/src/test/conf/hbase-site.xml > >>> > > >>> > El lun., 10 jun. 2019 a las 23:39, Alfonso Nishikawa > >>> > (<alfonso.nishik...@gmail.com>) escribió: > >>> > > > >>> > > Hi again, Sheriffo. > >>> > > > >>> > > More improvements to [1] over the last email: > >>> > > > >>> > > - fields.toArray() doesn't need a full array like in [6]. You > should do > >>> > > just fields.toArray(new String[0]), and better if you create an > array [0] > >>> > > and reuse it. That call only needs the type. > >>> > > - I guess the class at [2] will always be the same, so you don't > need to > >>> > > set it on every insert call. > >>> > > - The string concatenation is overkilling for the jvm on the 1M > calls * N > >>> > > fields at [3] and same for [4]. Precalculate the names in a list > or array > >>> > > and reuse then for the 1M*N calls. > >>> > > - Other optimization for [3] is, given that PersistentBase [5] > exctends > >>> > > SpecificRecordBase, you can access the fields by index with > >>> > > SpecificRecordBase.get(int) and SpecificRecordBase.put(int, > Object). > >>> > > > >>> > > [1] - > >>> > > > >>> > > https://github.com/sneceesay77/gora/blob/GORA-532/gora-benchmark/src/ma1in/java/org/apache/gora/benchmark/GoraBenchmarkClient.java#L127 > >>> > > [2] - > >>> > > > >>> > > https://github.com/sneceesay77/gora/blob/GORA-532/gora-benchmark/src/main/java/org/apache/gora/benchmark/GoraBenchmarkClient.java#L134 > >>> > > [3] - > >>> > > > >>> > > https://github.com/sneceesay77/gora/blob/GORA-532/gora-benchmark/src/main/java/org/apache/gora/benchmark/GoraBenchmarkClient.java#L136 > >>> > > [4] - > >>> > > > >>> > > https://github.com/sneceesay77/gora/blob/GORA-532/gora-benchmark/src/main/java/org/apache/gora/benchmark/GoraBenchmarkClient.java#L139 > >>> > > [5] - > >>> > > > >>> > > https://github.com/sneceesay77/gora/blob/GORA-532/gora-core/src/main/java/org/apache/gora/persistency/impl/PersistentBase.java#L3 > >>> > > [6] - > >>> > > > >>> > > https://github.com/sneceesay77/gora/blob/GORA-532/gora-benchmark/src/main/java/org/apache/gora/benchmark/GoraBenchmarkClient.java#L163 > >>> > > > >>> > > Let's see if with that optimizations we free the jvm memory > management > >>> > from > >>> > > much stress. > >>> > > > >>> > > Regards, > >>> > > > >>> > > Alfonso Nishikawa > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > El lun., 10 jun. 2019 a las 21:18, Alfonso Nishikawa (< > >>> > > alfonso.nishik...@gmail.com>) escribió: > >>> > > > >>> > > > Hi, Sheriffo. > >>> > > > > >>> > > > You can try reusing the Persistent instances [1] to insert the > data. I > >>> > > > don't know all the backends, but they should be reusable, at > least in > >>> > > > mongoDB and HBase. > >>> > > > > >>> > > > [1] - > >>> > > > > >>> > > https://github.com/sneceesay77/gora/blob/GORA-532/gora-benchmark/src/main/java/org/apache/gora/benchmark/GoraBenchmarkClient.java#L130 > >>> > > > > >>> > > > Regards, > >>> > > > > >>> > > > Alfonso Nishikawa > >>> > > > > >>> > > > El lun., 10 jun. 2019 a las 21:14, Alfonso Nishikawa (< > >>> > > > alfonso.nishik...@gmail.com>) escribió: > >>> > > > > >>> > > >> Hi, Sheriffo. > >>> > > >> > >>> > > >> I really don't know how to solve it, but are you setting any > Xmx / Xms > >>> > > >> configuration values? > >>> > > >> > >>> > > >> Regards, > >>> > > >> > >>> > > >> Alfonso NIshikawa > >>> > > >> > >>> > > >> > >>> > > >> El sáb., 8 jun. 2019 a las 16:02, Sheriffo Ceesay (< > >>> > sneceesa...@gmail.com>) > >>> > > >> escribió: > >>> > > >> > >>> > > >>> Hi All, > >>> > > >>> > >>> > > >>> Week 2 progress update is available at > >>> > > >>> > >>> > > >>> > >>> > > https://cwiki.apache.org/confluence/display/GORA/%5BGORA-532%5D+Apache+Gora+Benchmark+Module+Weekly+Report > >>> > > >>> > >>> > > >>> I have one question that I would like my mentors to advise on, > I am > >>> > still > >>> > > >>> working it but thought it would be good to report it because > it is > >>> > HBase > >>> > > >>> specific. > >>> > > >>> > >>> > > >>> So the problem has to do with an OutOfMemory error when > inserting 1M > >>> > + > >>> > > >>> record in HBase. This happens when I try to run the actual > >>> > benchmark by > >>> > > >>> first loading HBase with 1 million plus records. It works > perfectly > >>> > for > >>> > > >>> MongoDB but not HBase > >>> > > >>> > >>> > > >>> So I am assuming this problem is specific to HBase. The stack > trace > >>> > is > >>> > > >>> given below. > >>> > > >>> > >>> > > >>> Exception in thread "Thread-1" java.lang.OutOfMemoryError: GC > >>> > overhead > >>> > > >>> limit exceeded > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at > >>> > > >>> > java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at java.lang.StringCoding.encode(StringCoding.java:344) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at java.lang.String.getBytes(String.java:918) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at > org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:733) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at > >>> > > >>> > >>> > > >>> > >>> > > org.apache.gora.hbase.util.HBaseByteInterface.toBytes(HBaseByteInterface.java:225) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at > >>> > > >>> > >>> > > >>> > >>> > > org.apache.gora.hbase.store.HBaseStore.addPutsAndDeletes(HBaseStore.java:383) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at > >>> > > >>> > >>> > > >>> > >>> > > org.apache.gora.hbase.store.HBaseStore.addPutsAndDeletes(HBaseStore.java:348) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at > >>> > > >>> org.apache.gora.hbase.store.HBaseStore.put(HBaseStore.java:319) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at > >>> > org.apache.gora.hbase.store.HBaseStore.put(HBaseStore.java:84) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at > >>> > > >>> > >>> > > >>> > >>> > > org.apache.gora.benchmark.GoraBenchmarkClient.insert(GoraBenchmarkClient.java:141) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at com.yahoo.ycsb.DBWrapper.insert(DBWrapper.java:148) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at > >>> > > >>> > com.yahoo.ycsb.workloads.CoreWorkload.doInsert(CoreWorkload.java:461) > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> at com.yahoo.ycsb.ClientThread.run(Client.java:269) > >>> > > >>> > >>> > > >>> The insert implementation of the module available at > >>> > > >>> > https://github.com/sneceesay77/gora/tree/GORA-532/gora-benchmark in > >>> > > >>> GoraBenchmarkClient.java is very straight forward. I have had > a brief > >>> > > >>> look > >>> > > >>> at HBaseStore.java put() implementation but could not find an > issue > >>> > with > >>> > > >>> that. > >>> > > >>> > >>> > > >>> If I solve this problem, then I will do run more workloads to > verify > >>> > that > >>> > > >>> the module is stable for the basic implementation. Then I will > go > >>> > ahead > >>> > > >>> and > >>> > > >>> work on suggestions made by Renato last week. > >>> > > >>> > >>> > > >>> Please let me know what your thoughts are. > >>> > > >>> > >>> > > >>> > >>> > > >>> Thank you. > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> **Sheriffo Ceesay** > >>> > > >>> > >>> > > >> > >>> > >