Re: Issues about Partitioning and Record converter

Edward J. Yoon Wed, 08 May 2013 02:35:54 -0700

I think this is a important step to move forward, but let's close this
discussion by lazy consensus if no-one objects within the next three
days.




On Tue, May 7, 2013 at 11:26 AM, Edward J. Yoon <[email protected]> wrote:
> I've noted Suraj's suggestion and added my opinions, too -
> http://wiki.apache.org/hama/Partitioning
>
> In this thread, please focus on the problem of integration with
> NoSQLs. Since PartitioningRunner converts records of input data, and
> GraphJobRunner reads converted records from partition files, Table
> input must go unnecessarily through PartitioningRunner. That's the
> problem of current "Partitioning and Record converter".
>
>
> On Tue, May 7, 2013 at 9:50 AM, Edward J. Yoon <[email protected]> wrote:
>> And, using of superstep API is a improvement or approach of partition
>> processing. So, the main is whether we will parse vertex at bsp core
>> or graph job runner. Please don't shake.
>>
>> On Tue, May 7, 2013 at 9:45 AM, Edward J. Yoon <[email protected]> wrote:
>>> Hello all,
>>>
>>> a GSoC student who want to try to integrate NoSQLs with Graph looking
>>> at this thread. My suggestion is not a quick fix solution. It's a
>>> must. Please let me know whether you understand my suggestion or not.
>>>
>>> On Tue, May 7, 2013 at 9:38 AM, Edward J. Yoon <[email protected]> 
>>> wrote:
>>>> Do you need also a separated Wiki? :-) If not, please feel free to
>>>> describe your ideas on Wiki, dividing short-term/long-term plans.
>>>>
>>>> On Tue, May 7, 2013 at 4:35 AM, Edward J. Yoon <[email protected]> 
>>>> wrote:
>>>>> 1. Graph/Matrix data is small but Graph/Matrix algo requires huge
>>>>> computations. Hence, the number of BSP processors should be able to
>>>>> adjust ( != file blocks).
>>>>>
>>>>> 2. I'm -1 for using local disk to store partitions. HDFS is high cost.
>>>>> But, reuse of partitions should be considered.
>>>>>
>>>>> On Tue, May 7, 2013 at 2:08 AM, Tommaso Teofili
>>>>> <[email protected]> wrote:
>>>>>> 2013/5/6 Suraj Menon <[email protected]>
>>>>>>
>>>>>>> I am assuming that the storage of vertices (NoSQL or any other format) 
>>>>>>> need
>>>>>>> not be updated after every iteration.
>>>>>>>
>>>>>>> Based on the above assumption, I have the following suggestions:
>>>>>>>
>>>>>>> - Instead of running a separate job, we inject a partitioning superstep
>>>>>>> before the first superstep of the job. (This has a dependency on the
>>>>>>> Superstep API)
>>>>>>>
>>>>>>
>>>>>> could we do that without introducing that dependency? I mean would that
>>>>>> work also if not using the Superstep API on the client side?
>>>>>>
>>>>>>
>>>>>>> - The partitions instead of being written to HDFS, which is creating a 
>>>>>>> copy
>>>>>>> of input files in HDFS Cluster (too costly I believe), should be 
>>>>>>> written to
>>>>>>> local files and read from.
>>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>>> - For graph jobs, we can configure this partitioning superstep class
>>>>>>> specific to graph partitioning class that partitions and loads vertices.
>>>>>>>
>>>>>>
>>>>>> this seems to be inline with the above assumption thus it probably makes
>>>>>> sense.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> This sure has some dependencies. But would be a graceful solution and 
>>>>>>> can
>>>>>>> tackle every problem. This is what I want to achieve in the end. Please
>>>>>>> proceed if you have any intermediate ways to reach here faster.
>>>>>>>
>>>>>>
>>>>>> Your solution sounds good to me generally, better if we can avoid the
>>>>>> dependency, but still ok if not.
>>>>>> Let's collect also others' opinions and try to reach a shared consensus.
>>>>>>
>>>>>> Tommaso
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Suraj
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected]
>>>>>>> >wrote:
>>>>>>>
>>>>>>> > P.S., BSPJob (with table input) also the same. It's not only for
>>>>>>> GraphJob.
>>>>>>> >
>>>>>>> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]>
>>>>>>> > wrote:
>>>>>>> > > All,
>>>>>>> > >
>>>>>>> > > I've also roughly described details about design of Graph APIs[1]. 
>>>>>>> > > To
>>>>>>> > > reduce our misunderstandings (please read first Partitioning and
>>>>>>> > > GraphModuleInternals documents),
>>>>>>> > >
>>>>>>> > >  * In NoSQLs case, there's obviously no need to Hash-partitioning or
>>>>>>> > > rewrite partition files on HDFS. So, in these input cases, I think
>>>>>>> > > vertex structure should be parsed at GraphJobRunner.loadVertices()
>>>>>>> > > method.
>>>>>>> > >
>>>>>>> > > At here, we faced two options: 1) The current implementation of
>>>>>>> > > 'PartitioningRunner' writes converted vertices on sequence format
>>>>>>> > > partition files. And GraphJobRunner reads only Vertex Writable
>>>>>>> > > objects. If input is table, we maybe have to skip the Partitioning 
>>>>>>> > > job
>>>>>>> > > and have to parse vertex structure at loadVertices() method after
>>>>>>> > > checking some conditions. 2) PartitioningRunner just writes raw
>>>>>>> > > records to proper partition files after checking its partition ID. 
>>>>>>> > > And
>>>>>>> > > GraphJobRunner.loadVertices() always parses and loads vertices.
>>>>>>> > >
>>>>>>> > > I was mean that I prefer the latter and there's no need to write
>>>>>>> > > VertexWritable files. It's not related whether graph will support 
>>>>>>> > > only
>>>>>>> > > Seq format or not. Hope my explanation is enough!
>>>>>>> > >
>>>>>>> > > 1. http://wiki.apache.org/hama/GraphModuleInternals
>>>>>>> > >
>>>>>>> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon 
>>>>>>> > > <[email protected]
>>>>>>> >
>>>>>>> > wrote:
>>>>>>> > >> I've described my big picture here:
>>>>>>> > http://wiki.apache.org/hama/Partitioning
>>>>>>> > >>
>>>>>>> > >> Please review and feedback whether this is acceptable.
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote:
>>>>>>> > >>> p.s., i think theres mis understand. it doesn't mean that graph 
>>>>>>> > >>> will
>>>>>>> > support only sequence file format. Main is whether converting at
>>>>>>> > patitioning stage or loadVertices stage.
>>>>>>> > >>>
>>>>>>> > >>> Sent from my iPhone
>>>>>>> > >>>
>>>>>>> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]>
>>>>>>> wrote:
>>>>>>> > >>>
>>>>>>> > >>>> Sure, Please go ahead.
>>>>>>> > >>>>
>>>>>>> > >>>>
>>>>>>> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <
>>>>>>> [email protected]
>>>>>>> > >wrote:
>>>>>>> > >>>>
>>>>>>> > >>>>>>> Please let me know before this is changed, I would like to 
>>>>>>> > >>>>>>> work
>>>>>>> on
>>>>>>> > a
>>>>>>> > >>>>>>> separate branch.
>>>>>>> > >>>>>
>>>>>>> > >>>>> I personally, we have to focus on high priority tasks. and more
>>>>>>> > >>>>> feedbacks and contributions from users. So, if changes made, 
>>>>>>> > >>>>> I'll
>>>>>>> > >>>>> release periodically. If you want to work on another place, 
>>>>>>> > >>>>> please
>>>>>>> > do.
>>>>>>> > >>>>> I don't want to wait your patches.
>>>>>>> > >>>>>
>>>>>>> > >>>>>
>>>>>>> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <
>>>>>>> > [email protected]>
>>>>>>> > >>>>> wrote:
>>>>>>> > >>>>>> For preparing integration with NoSQLs, of course, maybe 
>>>>>>> > >>>>>> condition
>>>>>>> > >>>>>> check (whether converted or not) can be used without removing
>>>>>>> record
>>>>>>> > >>>>>> converter.
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> We need to discuss everything.
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <
>>>>>>> [email protected]
>>>>>>> > >
>>>>>>> > >>>>> wrote:
>>>>>>> > >>>>>>> I am still -1 if this means our graph module can work only on
>>>>>>> > sequential
>>>>>>> > >>>>>>> file format.
>>>>>>> > >>>>>>> Please note that you can set record converter to null and make
>>>>>>> > changes
>>>>>>> > >>>>> to
>>>>>>> > >>>>>>> loadVertices for what you desire here.
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> If we came to this design, because TextInputFormat is
>>>>>>> inefficient,
>>>>>>> > would
>>>>>>> > >>>>>>> this work for Avro or Thrift input format?
>>>>>>> > >>>>>>> Please let me know before this is changed, I would like to 
>>>>>>> > >>>>>>> work
>>>>>>> on
>>>>>>> > a
>>>>>>> > >>>>>>> separate branch.
>>>>>>> > >>>>>>> You may proceed as you wish.
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> Regards,
>>>>>>> > >>>>>>> Suraj
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon <
>>>>>>> > [email protected]
>>>>>>> > >>>>>> wrote:
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>>> I think 'record converter' should be removed. It's not good
>>>>>>> idea.
>>>>>>> > >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input
>>>>>>> > reader, we
>>>>>>> > >>>>>>>> can move related classes into common module.
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> Let's go with my original plan.
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon <
>>>>>>> > [email protected]>
>>>>>>> > >>>>>>>> wrote:
>>>>>>> > >>>>>>>>> Hi all,
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> I'm reading our old discussions about record converter,
>>>>>>> superstep
>>>>>>> > >>>>>>>>> injection, and common module:
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>>>>>>> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> To clarify goals and objectives:
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> 1. A parallel input partition is necessary for obtaining
>>>>>>> > scalability
>>>>>>> > >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing 
>>>>>>> > >>>>>>>>> (It's
>>>>>>> > not a
>>>>>>> > >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please 
>>>>>>> > >>>>>>>>> don't
>>>>>>> > >>>>>>>>> shake).
>>>>>>> > >>>>>>>>> 2. Input partitioning should be handled at BSP framework 
>>>>>>> > >>>>>>>>> level,
>>>>>>> > and
>>>>>>> > >>>>> it
>>>>>>> > >>>>>>>>> is for every Hama jobs, not only for Graph jobs.
>>>>>>> > >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs
>>>>>>> input
>>>>>>> > also
>>>>>>> > >>>>>>>>> should be considered.
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> The current problem is that every input of graph jobs 
>>>>>>> > >>>>>>>>> should be
>>>>>>> > >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me 
>>>>>>> > >>>>>>>>> know.
>>>>>>> > >>>>>>>>>
>>>>>>> > >>>>>>>>> --
>>>>>>> > >>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>> > >>>>>>>>> @eddieyoon
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> --
>>>>>>> > >>>>>>>> Best Regards, Edward J. Yoon
>>>>>>> > >>>>>>>> @eddieyoon
>>>>>>> > >>>>>>
>>>>>>> > >>>>>>
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> --
>>>>>>> > >>>>>> Best Regards, Edward J. Yoon
>>>>>>> > >>>>>> @eddieyoon
>>>>>>> > >>>>>
>>>>>>> > >>>>>
>>>>>>> > >>>>>
>>>>>>> > >>>>> --
>>>>>>> > >>>>> Best Regards, Edward J. Yoon
>>>>>>> > >>>>> @eddieyoon
>>>>>>> > >>>>>
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >> --
>>>>>>> > >> Best Regards, Edward J. Yoon
>>>>>>> > >> @eddieyoon
>>>>>>> > >
>>>>>>> > >
>>>>>>> > >
>>>>>>> > > --
>>>>>>> > > Best Regards, Edward J. Yoon
>>>>>>> > > @eddieyoon
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Best Regards, Edward J. Yoon
>>>>>>> > @eddieyoon
>>>>>>> >
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards, Edward J. Yoon
>>>>> @eddieyoon
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Issues about Partitioning and Record converter

Reply via email to