Re: getting/manupulating all fields in a pipe in scalding

'Alex Levenson' via Scalding Development Fri, 30 Jun 2017 16:14:46 -0700

Yeah, I guess what I was sort of getting as is that if you are using the
Typed API, you try to use types instead of names for these sorts of things,
and a lot of your code about casting one type to another goes away. But it
can be painful to rewrite your entire world in this way. I'm not familiar
enough with the un-typed API to tell you how to do this unfortunately, but
at least here at Twitter we would try to push our users towards using
strong types, and maybe the implicit typeclass pattern for extractors /
converters / etc. For example, you can see how TypedPipe.sumByKey takes an
implicit strongly typed Semigroup which explains how to "sum" two values.
Similarly, you can create an implicit Sparser type class that is picked
based on the types of the data at compile time.


On Fri, Jun 30, 2017 at 4:07 PM, <[email protected]> wrote:

> Thanks for the reply Alex!
>
> I'm trying to implement a couple of scenarios. The first scenario is
> pretty much what I explained in the post (i.e. appending a fixed
> prefix/suffix to every field name in a pipe). The second scenario, is that
> I want to iterate through all fields in a pipe and call a function on them
> based on their names. For example, let's say I have a bunch of different
> fields in a pipe and if the pipe name contains the string "_list_" I want
> to convert the List[Any] to a sparse representation of the list in the
> String format. I guess if I write a Cascading Function in java and invoke
> an "each" method on my pipe that should do the trick, but I was wondering
> if there is a cleaner/easier way of doing this in scalding:
>
> import java.util.Iterator;
> import cascading.operation.*;
> import cascading.tuple.*;
> import cascading.flow.*;
>
> public class Sparser extends BaseOperation<Tuple> implements
> Function<Tuple>
> {
> public Sparser()
>   {
>   super(new Fields( "sum" ) );
>   }
>
> public Sparser( Fields fieldDeclaration )
>   {
>   super(fieldDeclaration );
>   }
>
> public void operate( FlowProcess flowProcess, FunctionCall<Tuple>
> functionCall )
>   {
>   // get the arguments TupleEntry
>   Fields fieldNames = functionCall.getArgumentFields();
>   TupleEntry arguments = functionCall.getArguments();
>
>   // create a Tuple to hold our result values
>   Tuple result = new Tuple();
>
>   Iterator iterator = arguments.getTuple().iterator();
>   int i = 0;
>   while(iterator.hasNext())
>   {
>       Object obj = iterator.next();
>       if (fieldNames.get(i).toString().contains("_list_")){
>           java.util.List<Double> tmp = (java.util.List<Double>)obj;
>           String sparsRepresentation = tmp.toString();// TO BE IMPLEMENTED
>           result.add(sparsRepresentation);
>       }
>       else
>           result.add((String)obj);
>       i++;
>   }
>
>   // return the result Tuple
>   functionCall.getOutputCollector().add( result );
>   }
> }
>
> btw, I'm not sure if I understand what you mean by "an extractor method",
> can you please send me a pointer to an example?
>
> Any input is greatly appreciated!
>
> On Friday, June 30, 2017 at 3:51:09 PM UTC-7, Alex Levenson wrote:
>>
>> Probably not what you want to hear, but the scalding dev team is really
>> only developing + supporting the Typed API at this point -- which would
>> make something like this even more difficult.
>> But the question I'd probably ask is what are you trying to do, and can
>> you use strong types, the Typed Api, and maybe an extractor method or
>> similar instead?
>>
>> On Fri, Jun 30, 2017 at 2:13 PM, <[email protected]> wrote:
>>
>>> Here is the question:
>>>
>>> Assume I have a pipe and I want to rename all the fields in the pipe
>>> programmatically, meaning that I don't want to hard code the field names in
>>> my code. Any idea how I can do this?
>>>
>>> As a concrete example, assume I have a pipe with two fields: "name" and
>>> "age" and I want to rename these fields to "employee_name" and
>>> "employee_age". Obviously the natural solution is to write a piece of code
>>> as below:
>>>
>>> pipe.rename(('name, 'age) -> ('employee_name, 'employee_age))
>>>
>>> or
>>>
>>> pipe.rename(new Fields("name", "age") ->  new Fields("employee_name",
>>> "employee_age"))
>>>
>>> However, what I need is to be able to iterate through all fields in the
>>> pipe without knowing their names.
>>>
>>> There are a couple of methods (resolveIncomingOperationArgumentFields
>>> and resolveIncomingOperationPassThroughFields) callable on a pipe which
>>> look promising but the issue is that they both take and input argument of
>>> type cascading.flow.planner.Scope which I don't know where can I get it
>>> from in a scalding job.
>>>
>>> Another solution that comes to my mind is using "each" method on the
>>> pipe and implementing a cascading function and pass it to the each
>>> statement. But I was now able to find any sample code for that either.
>>>
>>> Thanks!
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Scalding Development" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>> Alex Levenson
>> @THISWILLWORK
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Alex Levenson
@THISWILLWORK

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: getting/manupulating all fields in a pipe in scalding

Reply via email to