Do you need also a separated Wiki? :-) If not, please feel free to describe your ideas on Wiki, dividing short-term/long-term plans.
On Tue, May 7, 2013 at 4:35 AM, Edward J. Yoon <[email protected]> wrote: > 1. Graph/Matrix data is small but Graph/Matrix algo requires huge > computations. Hence, the number of BSP processors should be able to > adjust ( != file blocks). > > 2. I'm -1 for using local disk to store partitions. HDFS is high cost. > But, reuse of partitions should be considered. > > On Tue, May 7, 2013 at 2:08 AM, Tommaso Teofili > <[email protected]> wrote: >> 2013/5/6 Suraj Menon <[email protected]> >> >>> I am assuming that the storage of vertices (NoSQL or any other format) need >>> not be updated after every iteration. >>> >>> Based on the above assumption, I have the following suggestions: >>> >>> - Instead of running a separate job, we inject a partitioning superstep >>> before the first superstep of the job. (This has a dependency on the >>> Superstep API) >>> >> >> could we do that without introducing that dependency? I mean would that >> work also if not using the Superstep API on the client side? >> >> >>> - The partitions instead of being written to HDFS, which is creating a copy >>> of input files in HDFS Cluster (too costly I believe), should be written to >>> local files and read from. >>> >> >> +1 >> >> >>> - For graph jobs, we can configure this partitioning superstep class >>> specific to graph partitioning class that partitions and loads vertices. >>> >> >> this seems to be inline with the above assumption thus it probably makes >> sense. >> >> >>> >>> This sure has some dependencies. But would be a graceful solution and can >>> tackle every problem. This is what I want to achieve in the end. Please >>> proceed if you have any intermediate ways to reach here faster. >>> >> >> Your solution sounds good to me generally, better if we can avoid the >> dependency, but still ok if not. >> Let's collect also others' opinions and try to reach a shared consensus. >> >> Tommaso >> >> >> >> >>> >>> Regards, >>> Suraj >>> >>> >>> >>> >>> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected] >>> >wrote: >>> >>> > P.S., BSPJob (with table input) also the same. It's not only for >>> GraphJob. >>> > >>> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]> >>> > wrote: >>> > > All, >>> > > >>> > > I've also roughly described details about design of Graph APIs[1]. To >>> > > reduce our misunderstandings (please read first Partitioning and >>> > > GraphModuleInternals documents), >>> > > >>> > > * In NoSQLs case, there's obviously no need to Hash-partitioning or >>> > > rewrite partition files on HDFS. So, in these input cases, I think >>> > > vertex structure should be parsed at GraphJobRunner.loadVertices() >>> > > method. >>> > > >>> > > At here, we faced two options: 1) The current implementation of >>> > > 'PartitioningRunner' writes converted vertices on sequence format >>> > > partition files. And GraphJobRunner reads only Vertex Writable >>> > > objects. If input is table, we maybe have to skip the Partitioning job >>> > > and have to parse vertex structure at loadVertices() method after >>> > > checking some conditions. 2) PartitioningRunner just writes raw >>> > > records to proper partition files after checking its partition ID. And >>> > > GraphJobRunner.loadVertices() always parses and loads vertices. >>> > > >>> > > I was mean that I prefer the latter and there's no need to write >>> > > VertexWritable files. It's not related whether graph will support only >>> > > Seq format or not. Hope my explanation is enough! >>> > > >>> > > 1. http://wiki.apache.org/hama/GraphModuleInternals >>> > > >>> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <[email protected] >>> > >>> > wrote: >>> > >> I've described my big picture here: >>> > http://wiki.apache.org/hama/Partitioning >>> > >> >>> > >> Please review and feedback whether this is acceptable. >>> > >> >>> > >> >>> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote: >>> > >>> p.s., i think theres mis understand. it doesn't mean that graph will >>> > support only sequence file format. Main is whether converting at >>> > patitioning stage or loadVertices stage. >>> > >>> >>> > >>> Sent from my iPhone >>> > >>> >>> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> >>> wrote: >>> > >>> >>> > >>>> Sure, Please go ahead. >>> > >>>> >>> > >>>> >>> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon < >>> [email protected] >>> > >wrote: >>> > >>>> >>> > >>>>>>> Please let me know before this is changed, I would like to work >>> on >>> > a >>> > >>>>>>> separate branch. >>> > >>>>> >>> > >>>>> I personally, we have to focus on high priority tasks. and more >>> > >>>>> feedbacks and contributions from users. So, if changes made, I'll >>> > >>>>> release periodically. If you want to work on another place, please >>> > do. >>> > >>>>> I don't want to wait your patches. >>> > >>>>> >>> > >>>>> >>> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon < >>> > [email protected]> >>> > >>>>> wrote: >>> > >>>>>> For preparing integration with NoSQLs, of course, maybe condition >>> > >>>>>> check (whether converted or not) can be used without removing >>> record >>> > >>>>>> converter. >>> > >>>>>> >>> > >>>>>> We need to discuss everything. >>> > >>>>>> >>> > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon < >>> [email protected] >>> > > >>> > >>>>> wrote: >>> > >>>>>>> I am still -1 if this means our graph module can work only on >>> > sequential >>> > >>>>>>> file format. >>> > >>>>>>> Please note that you can set record converter to null and make >>> > changes >>> > >>>>> to >>> > >>>>>>> loadVertices for what you desire here. >>> > >>>>>>> >>> > >>>>>>> If we came to this design, because TextInputFormat is >>> inefficient, >>> > would >>> > >>>>>>> this work for Avro or Thrift input format? >>> > >>>>>>> Please let me know before this is changed, I would like to work >>> on >>> > a >>> > >>>>>>> separate branch. >>> > >>>>>>> You may proceed as you wish. >>> > >>>>>>> >>> > >>>>>>> Regards, >>> > >>>>>>> Suraj >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon < >>> > [email protected] >>> > >>>>>> wrote: >>> > >>>>>>> >>> > >>>>>>>> I think 'record converter' should be removed. It's not good >>> idea. >>> > >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input >>> > reader, we >>> > >>>>>>>> can move related classes into common module. >>> > >>>>>>>> >>> > >>>>>>>> Let's go with my original plan. >>> > >>>>>>>> >>> > >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon < >>> > [email protected]> >>> > >>>>>>>> wrote: >>> > >>>>>>>>> Hi all, >>> > >>>>>>>>> >>> > >>>>>>>>> I'm reading our old discussions about record converter, >>> superstep >>> > >>>>>>>>> injection, and common module: >>> > >>>>>>>>> >>> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc >>> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4 >>> > >>>>>>>>> >>> > >>>>>>>>> To clarify goals and objectives: >>> > >>>>>>>>> >>> > >>>>>>>>> 1. A parallel input partition is necessary for obtaining >>> > scalability >>> > >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's >>> > not a >>> > >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please don't >>> > >>>>>>>>> shake). >>> > >>>>>>>>> 2. Input partitioning should be handled at BSP framework level, >>> > and >>> > >>>>> it >>> > >>>>>>>>> is for every Hama jobs, not only for Graph jobs. >>> > >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs >>> input >>> > also >>> > >>>>>>>>> should be considered. >>> > >>>>>>>>> >>> > >>>>>>>>> The current problem is that every input of graph jobs should be >>> > >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me know. >>> > >>>>>>>>> >>> > >>>>>>>>> -- >>> > >>>>>>>>> Best Regards, Edward J. Yoon >>> > >>>>>>>>> @eddieyoon >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> -- >>> > >>>>>>>> Best Regards, Edward J. Yoon >>> > >>>>>>>> @eddieyoon >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> -- >>> > >>>>>> Best Regards, Edward J. Yoon >>> > >>>>>> @eddieyoon >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> -- >>> > >>>>> Best Regards, Edward J. Yoon >>> > >>>>> @eddieyoon >>> > >>>>> >>> > >> >>> > >> >>> > >> >>> > >> -- >>> > >> Best Regards, Edward J. Yoon >>> > >> @eddieyoon >>> > > >>> > > >>> > > >>> > > -- >>> > > Best Regards, Edward J. Yoon >>> > > @eddieyoon >>> > >>> > >>> > >>> > -- >>> > Best Regards, Edward J. Yoon >>> > @eddieyoon >>> > >>> > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon -- Best Regards, Edward J. Yoon @eddieyoon
