hi, haoyuan, thanks for replying.
2014-07-21 16:29 GMT+08:00 Haoyuan Li <haoyuan...@gmail.com>: > Qingyang, > > Aha. Got it. > > 800MB data is pretty small. Loading from Tachyon does have a bit of extra > overhead. But it will have more benefit when the data size is larger. Also, > if you store the table in Tachyon, you can have different shark servers to > query the data at the same time. For more trade-off, please refer to this > page: http://tachyon-project.org/Running-Shark-on-Tachyon.html > > Best, > > Haoyuan > > > On Wed, Jul 16, 2014 at 12:06 AM, qingyang li <liqingyang1...@gmail.com> > wrote: > > > let's me describe my scene: > > ---------------------- > > i have 8 machines (24 core , 16G memory, per machine) of spark cluster > and > > tachyon cluster. On tachyon, I create one table which contains 800M > data, > > when i run query sql on shark, it will cost 2.43s, but when i create > the > > same table on spark memory , i run the same sql , it will cost 1.56s. > > data on tachyon cost more time than data on spark memory. they all > have > > 150 map process, and per node 16-20 map process. > > I think the reason is that when data is on tachyon, shark will let spark > > slave load data from tachyon salve which is on the same node with tachyon > > slave, > > i have tried to set some configuration to tune shark and tachyon, but > still > > can not make the former more fast than 2.43s. > > do anyone have some ideas ? > > > > By the way , my tachyon block size is 1GB now, i want to reset block > size > > , will it work by setting tachyon.user.default.block.size.byte=8M ? if > > not, what does tachyon.user.default.block.size.byte mean? > > > > > > 2014-07-14 13:13 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: > > > > > Shark, thanks for replying. > > > Let's me clear my question again. > > > ---------------------------------------------- > > > i create a table using " create table xxx1 > > > tblproperties("shark.cache"="tachyon") as select * from xxx2" > > > when excuting some sql (for example , select * from xxx1) using shark, > > > shark will read data into shark's memory from tachyon's memory. > > > I think if each time we execute sql, shark always load data from > tachyon, > > > it is less effient. > > > could we use some cache policy (such as, CacheAllPolicy > FIFOCachePolicy > > > LRUCachePolicy ) to cache data to invoid reading data from tachyon for > > > each sql query? > > > ---------------------------------------------- > > > > > > > > > > > > 2014-07-14 2:47 GMT+08:00 Haoyuan Li <haoyuan...@gmail.com>: > > > > > > Qingyang, > > >> > > >> Are you asking Spark or Shark (The first email was "Shark", the last > > email > > >> was "Spark".)? > > >> > > >> Best, > > >> > > >> Haoyuan > > >> > > >> > > >> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li <liqingyang1...@gmail.com > > > > >> wrote: > > >> > > >> > could i set some cache policy to let spark load data from tachyon > only > > >> one > > >> > time for all sql query? for example by using CacheAllPolicy > > >> > FIFOCachePolicy LRUCachePolicy. But I have tried that three policy, > > >> they > > >> > are not useful. > > >> > I think , if spark always load data for each sql query, it will > > impact > > >> the > > >> > query speed , it will take more time than the case that data are > > >> managed by > > >> > spark itself. > > >> > > > >> > > > >> > > > >> > > > >> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li <haoyuan...@gmail.com>: > > >> > > > >> > > Yes. For Shark, two modes, "shark.cache=tachyon" and > > >> > "shark.cache=memory", > > >> > > have the same ser/de overhead. Shark loads data from outsize of > the > > >> > process > > >> > > in Tachyon mode with the following benefits: > > >> > > > > >> > > > > >> > > - In-memory data sharing across multiple Shark instances (i.e. > > >> > stronger > > >> > > isolation) > > >> > > - Instant recovery of in-memory tables > > >> > > - Reduce heap size => faster GC in shark > > >> > > - If the table is larger than the memory size, only the hot > > columns > > >> > will > > >> > > be cached in memory > > >> > > > > >> > > from > > http://tachyon-project.org/master/Running-Shark-on-Tachyon.html > > >> and > > >> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon > > >> > > > > >> > > Haoyuan > > >> > > > > >> > > > > >> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson < > ilike...@gmail.com> > > >> > wrote: > > >> > > > > >> > > > Shark's in-memory format is already serialized (it's compressed > > and > > >> > > > column-based). > > >> > > > > > >> > > > > > >> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan < > > >> mri...@gmail.com> > > >> > > > wrote: > > >> > > > > > >> > > > > You are ignoring serde costs :-) > > >> > > > > > > >> > > > > - Mridul > > >> > > > > > > >> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson < > > >> ilike...@gmail.com> > > >> > > > wrote: > > >> > > > > > Tachyon should only be marginally less performant than > > >> memory_only, > > >> > > > > because > > >> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to, > > say, > > >> > > > transfer > > >> > > > > > the data over a pipe from Tachyon; we can directly read from > > the > > >> > > > buffers > > >> > > > > in > > >> > > > > > the same way that Shark reads from its in-memory columnar > > >> format. > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li < > > >> > > liqingyang1...@gmail.com> > > >> > > > > > wrote: > > >> > > > > > > > >> > > > > >> hi, when i create a table, i can point the cache strategy > > using > > >> > > > > >> shark.cache, > > >> > > > > >> i think "shark.cache=memory_only" means data are managed > by > > >> > spark, > > >> > > > and > > >> > > > > >> data are in the same jvm with excutor; while > > >> > > "shark.cache=tachyon" > > >> > > > > >> means data are managed by tachyon which is off heap, and > > data > > >> > are > > >> > > > not > > >> > > > > in > > >> > > > > >> the same jvm with excutor, so spark will load data from > > >> tachyon > > >> > for > > >> > > > > each > > >> > > > > >> query sql , so, is tachyon less efficient than > memory_only > > >> cache > > >> > > > > strategy > > >> > > > > >> ? > > >> > > > > >> if yes, can we let spark load all data once from tachyon > for > > >> all > > >> > > sql > > >> > > > > query > > >> > > > > >> if i want to use tachyon cache strategy since tachyon is > > more > > >> HA > > >> > > than > > >> > > > > >> memory_only ? > > >> > > > > >> > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > -- > > >> > > Haoyuan Li > > >> > > AMPLab, EECS, UC Berkeley > > >> > > http://www.cs.berkeley.edu/~haoyuan/ > > >> > > > > >> > > > >> > > >> > > >> > > >> -- > > >> Haoyuan Li > > >> AMPLab, EECS, UC Berkeley > > >> http://www.cs.berkeley.edu/~haoyuan/ > > >> > > > > > > > > > > > > -- > Haoyuan Li > AMPLab, EECS, UC Berkeley > http://www.cs.berkeley.edu/~haoyuan/ >