Re: Issues about Partitioning and Record converter

Edward J. Yoon Mon, 06 May 2013 17:45:32 -0700

Hello all,

a GSoC student who want to try to integrate NoSQLs with Graph looking
at this thread. My suggestion is not a quick fix solution. It's a
must. Please let me know whether you understand my suggestion or not.


On Tue, May 7, 2013 at 9:38 AM, Edward J. Yoon <[email protected]> wrote:
> Do you need also a separated Wiki? :-) If not, please feel free to
> describe your ideas on Wiki, dividing short-term/long-term plans.
>
> On Tue, May 7, 2013 at 4:35 AM, Edward J. Yoon <[email protected]> wrote:
>> 1. Graph/Matrix data is small but Graph/Matrix algo requires huge
>> computations. Hence, the number of BSP processors should be able to
>> adjust ( != file blocks).
>>
>> 2. I'm -1 for using local disk to store partitions. HDFS is high cost.
>> But, reuse of partitions should be considered.
>>
>> On Tue, May 7, 2013 at 2:08 AM, Tommaso Teofili
>> <[email protected]> wrote:
>>> 2013/5/6 Suraj Menon <[email protected]>
>>>
>>>> I am assuming that the storage of vertices (NoSQL or any other format) need
>>>> not be updated after every iteration.
>>>>
>>>> Based on the above assumption, I have the following suggestions:
>>>>
>>>> - Instead of running a separate job, we inject a partitioning superstep
>>>> before the first superstep of the job. (This has a dependency on the
>>>> Superstep API)
>>>>
>>>
>>> could we do that without introducing that dependency? I mean would that
>>> work also if not using the Superstep API on the client side?
>>>
>>>
>>>> - The partitions instead of being written to HDFS, which is creating a copy
>>>> of input files in HDFS Cluster (too costly I believe), should be written to
>>>> local files and read from.
>>>>
>>>
>>> +1
>>>
>>>
>>>> - For graph jobs, we can configure this partitioning superstep class
>>>> specific to graph partitioning class that partitions and loads vertices.
>>>>
>>>
>>> this seems to be inline with the above assumption thus it probably makes
>>> sense.
>>>
>>>
>>>>
>>>> This sure has some dependencies. But would be a graceful solution and can
>>>> tackle every problem. This is what I want to achieve in the end. Please
>>>> proceed if you have any intermediate ways to reach here faster.
>>>>
>>>
>>> Your solution sounds good to me generally, better if we can avoid the
>>> dependency, but still ok if not.
>>> Let's collect also others' opinions and try to reach a shared consensus.
>>>
>>> Tommaso
>>>
>>>
>>>
>>>
>>>>
>>>> Regards,
>>>> Suraj
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected]
>>>> >wrote:
>>>>
>>>> > P.S., BSPJob (with table input) also the same. It's not only for
>>>> GraphJob.
>>>> >
>>>> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]>
>>>> > wrote:
>>>> > > All,
>>>> > >
>>>> > > I've also roughly described details about design of Graph APIs[1]. To
>>>> > > reduce our misunderstandings (please read first Partitioning and
>>>> > > GraphModuleInternals documents),
>>>> > >
>>>> > >  * In NoSQLs case, there's obviously no need to Hash-partitioning or
>>>> > > rewrite partition files on HDFS. So, in these input cases, I think
>>>> > > vertex structure should be parsed at GraphJobRunner.loadVertices()
>>>> > > method.
>>>> > >
>>>> > > At here, we faced two options: 1) The current implementation of
>>>> > > 'PartitioningRunner' writes converted vertices on sequence format
>>>> > > partition files. And GraphJobRunner reads only Vertex Writable
>>>> > > objects. If input is table, we maybe have to skip the Partitioning job
>>>> > > and have to parse vertex structure at loadVertices() method after
>>>> > > checking some conditions. 2) PartitioningRunner just writes raw
>>>> > > records to proper partition files after checking its partition ID. And
>>>> > > GraphJobRunner.loadVertices() always parses and loads vertices.
>>>> > >
>>>> > > I was mean that I prefer the latter and there's no need to write
>>>> > > VertexWritable files. It's not related whether graph will support only
>>>> > > Seq format or not. Hope my explanation is enough!
>>>> > >
>>>> > > 1. http://wiki.apache.org/hama/GraphModuleInternals
>>>> > >
>>>> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <[email protected]
>>>> >
>>>> > wrote:
>>>> > >> I've described my big picture here:
>>>> > http://wiki.apache.org/hama/Partitioning
>>>> > >>
>>>> > >> Please review and feedback whether this is acceptable.
>>>> > >>
>>>> > >>
>>>> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote:
>>>> > >>> p.s., i think theres mis understand. it doesn't mean that graph will
>>>> > support only sequence file format. Main is whether converting at
>>>> > patitioning stage or loadVertices stage.
>>>> > >>>
>>>> > >>> Sent from my iPhone
>>>> > >>>
>>>> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]>
>>>> wrote:
>>>> > >>>
>>>> > >>>> Sure, Please go ahead.
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <
>>>> [email protected]
>>>> > >wrote:
>>>> > >>>>
>>>> > >>>>>>> Please let me know before this is changed, I would like to work
>>>> on
>>>> > a
>>>> > >>>>>>> separate branch.
>>>> > >>>>>
>>>> > >>>>> I personally, we have to focus on high priority tasks. and more
>>>> > >>>>> feedbacks and contributions from users. So, if changes made, I'll
>>>> > >>>>> release periodically. If you want to work on another place, please
>>>> > do.
>>>> > >>>>> I don't want to wait your patches.
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <
>>>> > [email protected]>
>>>> > >>>>> wrote:
>>>> > >>>>>> For preparing integration with NoSQLs, of course, maybe condition
>>>> > >>>>>> check (whether converted or not) can be used without removing
>>>> record
>>>> > >>>>>> converter.
>>>> > >>>>>>
>>>> > >>>>>> We need to discuss everything.
>>>> > >>>>>>
>>>> > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <
>>>> [email protected]
>>>> > >
>>>> > >>>>> wrote:
>>>> > >>>>>>> I am still -1 if this means our graph module can work only on
>>>> > sequential
>>>> > >>>>>>> file format.
>>>> > >>>>>>> Please note that you can set record converter to null and make
>>>> > changes
>>>> > >>>>> to
>>>> > >>>>>>> loadVertices for what you desire here.
>>>> > >>>>>>>
>>>> > >>>>>>> If we came to this design, because TextInputFormat is
>>>> inefficient,
>>>> > would
>>>> > >>>>>>> this work for Avro or Thrift input format?
>>>> > >>>>>>> Please let me know before this is changed, I would like to work
>>>> on
>>>> > a
>>>> > >>>>>>> separate branch.
>>>> > >>>>>>> You may proceed as you wish.
>>>> > >>>>>>>
>>>> > >>>>>>> Regards,
>>>> > >>>>>>> Suraj
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon <
>>>> > [email protected]
>>>> > >>>>>> wrote:
>>>> > >>>>>>>
>>>> > >>>>>>>> I think 'record converter' should be removed. It's not good
>>>> idea.
>>>> > >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input
>>>> > reader, we
>>>> > >>>>>>>> can move related classes into common module.
>>>> > >>>>>>>>
>>>> > >>>>>>>> Let's go with my original plan.
>>>> > >>>>>>>>
>>>> > >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon <
>>>> > [email protected]>
>>>> > >>>>>>>> wrote:
>>>> > >>>>>>>>> Hi all,
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> I'm reading our old discussions about record converter,
>>>> superstep
>>>> > >>>>>>>>> injection, and common module:
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>>>> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> To clarify goals and objectives:
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> 1. A parallel input partition is necessary for obtaining
>>>> > scalability
>>>> > >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's
>>>> > not a
>>>> > >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please don't
>>>> > >>>>>>>>> shake).
>>>> > >>>>>>>>> 2. Input partitioning should be handled at BSP framework level,
>>>> > and
>>>> > >>>>> it
>>>> > >>>>>>>>> is for every Hama jobs, not only for Graph jobs.
>>>> > >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs
>>>> input
>>>> > also
>>>> > >>>>>>>>> should be considered.
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> The current problem is that every input of graph jobs should be
>>>> > >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me know.
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> --
>>>> > >>>>>>>>> Best Regards, Edward J. Yoon
>>>> > >>>>>>>>> @eddieyoon
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>> --
>>>> > >>>>>>>> Best Regards, Edward J. Yoon
>>>> > >>>>>>>> @eddieyoon
>>>> > >>>>>>
>>>> > >>>>>>
>>>> > >>>>>>
>>>> > >>>>>> --
>>>> > >>>>>> Best Regards, Edward J. Yoon
>>>> > >>>>>> @eddieyoon
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>> --
>>>> > >>>>> Best Regards, Edward J. Yoon
>>>> > >>>>> @eddieyoon
>>>> > >>>>>
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >> --
>>>> > >> Best Regards, Edward J. Yoon
>>>> > >> @eddieyoon
>>>> > >
>>>> > >
>>>> > >
>>>> > > --
>>>> > > Best Regards, Edward J. Yoon
>>>> > > @eddieyoon
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best Regards, Edward J. Yoon
>>>> > @eddieyoon
>>>> >
>>>>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Issues about Partitioning and Record converter

Reply via email to