Re: General Purpose Pipeline library?

Friedrich Rentsch Wed, 22 Nov 2017 02:00:07 -0800


On 11/21/2017 03:26 PM, Jason wrote:

On Monday, November 20, 2017 at 10:49:01 AM UTC-5, Jason wrote:

a pipeline can be described as a sequence of functions that are applied to an 
input with each subsequent function getting the output of the preceding 
function:

out = f6(f5(f4(f3(f2(f1(in))))))

However this isn't very readable and does not support conditionals.

Tensorflow has tensor-focused pipepines:
     fc1 = layers.fully_connected(x, 256, activation_fn=tf.nn.relu, scope='fc1')
     fc2 = layers.fully_connected(fc1, 256, activation_fn=tf.nn.relu, 
scope='fc2')
     out = layers.fully_connected(fc2, 10, activation_fn=None, scope='out')

I have some code which allows me to mimic this, but with an implied parameter.

def executePipeline(steps, collection_funcs = [map, filter, reduce]):
        results = None
        for step in steps:
                func = step[0]
                params = step[1]
                if func in collection_funcs:
                        print func, params[0]
                        results = func(functools.partial(params[0], 
*params[1:]), results)
                else:
                        print func
                        if results is None:
                                results = func(*params)
                        else:
                                results = func(*(params+(results,)))
        return results

executePipeline( [
                                (read_rows, (in_file,)),
                                (map, (lower_row, field)),
                                (stash_rows, ('stashed_file', )),
                                (map, (lemmatize_row, field)),
                                (vectorize_rows, (field, min_count,)),
                                (evaluate_rows, (weights, None)),
                                (recombine_rows, ('stashed_file', )),
                                (write_rows, (out_file,))
                        ]
)

Which gets me close, but I can't control where rows gets passed in. In the 
above code, it is always the last parameter.

I feel like I'm reinventing a wheel here.  I was wondering if there's already 
something that exists?

Why do I want this? Because I'm tired of writing code that is locked away in a 
bespoke function. I'd  have an army of functions all slightly different in 
functionality. I require flexibility in defining pipelines, and I don't want a 
custom pipeline to require any low-level coding. I just want to feed a sequence 
of functions to a script and have it process it. A middle ground between the 
shell | operator and bespoke python code. Sure, I could write many binaries 
bound by shell, but there are some things done far easier in python because of 
its extensive libraries and it can exist throughout the execution of the 
pipeline whereas any temporary persistence  has to be though environment 
variables or files.

Well after examining your feedback, it looks like Grapevine has 99% of the 
concepts that I wanted to invent, even if the | operator seems a bit clunky. I 
personally prefer the affluent interface convention. But this should work.

Kamaelia could also work, but it seems a little bit more grandiose.


Thanks everyone who chimed in!

This looks very much like I what I have been working on of late: ageneric processing paradigm based on chainable building blocks. I callthem Workshops, because the base class can be thought of as a workshopthat takes some raw material, processes it and delivers the product (tothe next in line). Your example might look something like this:


    >>> import workshops as WS

    >>> Vectorizer = WS.Chain (
            WS.File_Reader (),        # WS provides
            WS.Map (lower_row),       # WS provides (wrapped builtin)
            Row_Stasher (),           # You provide

WS.Map (lemmatize_row), # WS provides. Name for addressedDirections sending.

            Row_Vectorizer (),        # Yours
            Row_Evaluator (),         # Yours
            Row_Recombiner (),
            WS.File_Writer (),
            _name = 'Vectorizer'
        )

Parameters are process-control settings that travel through asubscription-based mailing system separate from the payload pipe.

>>> Vectorizer.post (min_count = ..., ) # Set all parameters thatcontrol the entire run. >>> Vectorizer.post ("File_Writer", file_name ='output_file_name') # Addressed, not meant for File_Reader

Run

>>> Vectorizer ('input_file_name') # File Writer returns 0 ifthe Chain completes successfully.

If you would provide a list of your functions (input, output,parameters) I'd be happy to show a functioning solution. Writing a Shopfollows a simple standard pattern: Naming the subscriptions, if any, andwriting a single method that reads the subscribed parameters, if any,then takes payload, processes it and returns the product.

I intend to share the system, provided there's an interest. I'dhave to tidy it up quite a bit, though, before daring to release it.


    There's a lot more to it . . .

Frederic

--
https://mail.python.org/mailman/listinfo/python-list

Re: General Purpose Pipeline library?

Reply via email to