[Tutor] Pythonic way

Avi Gross Tue, 20 Nov 2018 11:26:39 -0800

This is not a question or reply. Nor is it short. If not interested, feel free 
to delete.


 

It is an observation based on recent experiences.

 

We have had quite a few messages that pointed out how some people approach  
solving a problem using subconscious paradigms inherited from their past. This 
includes people who never programmed and are thinking of how they might do it 
manually as well as people who are proficient in one or more other computer 
languages and their first attempt to visualize the solution may lead along 
paths that are possibly doable in Python but not optimal or even suggested.

 

I recently had to throw together a routine that would extract info from 
multiple SAS data files and join them together on one key into a large 
DataFrame (or data.frame or other such names for a tabular object.). Then I 
needed to write them out to disk either as a CSV or XLSX file for future use.

 

Since I have studied and used (not to mention abused) many programming 
languages, my first thought was to do this in R. It has lots of the tools 
needed to do such things including packages (sort of like modules you can 
import but not exactly) and I have done many data/graphics programs in it. I 
then redid it in Python after some thought.

 

The pseudocode outline is:

 

*       Read in all the files into a set of data.frame objects.
*       Trim back the variables/columns of some of them as many are not needed.
*       Join them together on a common index using a full or outer join.
*       Save the result on disk as a Comma Separated Values file.
*       Save the result on disk as a named tab in a new style EXCEL file.

 

I determined some of what I might us such as the needed built-in commands, 
packages and functions I could use for the early parts but ran into an 
annoyance as some of the files contained duplicate entries. Luckily, the R 
function reduce (not the same as map/reduce) is like many things in R and takes 
a list of items and makes it work. Also, by default, it renames duplicates so 
if you have ALPHA in multiple places, it names them ALPHA.x and ALPHA.x.x and 
other variations.

 

df_joined <- reduce(df_list, full_join, by = "COMMON")

 

Mind you, when I stripped away many of the columns not needed in some of the 
files, there were fewer duplicates and a much smaller output file.

 

But I ran into a wall later. Saving into a CSV is trivial. There are multiple 
packages meant to be used for saving into a XLSX file but they all failed for 
me. One wanted something in JAVA and another may want PERL and some may want 
packages I do not have installed. So, rather than bash my head against the 
wall, Itwisted and used the best XSLX maker there is. I opened the CSV file in 
EXCEL manually and did a SAVE AS …

 

Then I went to plan C (no, not the language C or its many extensions like C++) 
as I am still learning Python and have not used it much. As an exercise, I 
decided to learn how to do this in Python using tools/modules like numpy and 
pandas that I have not had much need for as well as additional tools for 
reading and writing files in other formats.

 

My first attempts gradually worked, after lots of mistakes and looking at 
manual pages. It followed an eclectic set of paradigms but worked.  Not 
immediately, as I ran into a problem in that the pandas version of a join did 
not tolerate duplicate column names when used on a list. I could get it to 
rename the left or right list (adding a suffix a suffix) when used on exactly 
two DataFrames. So, I needed to take the first df and do a df.join(second, …) 
then take that and join the third and so on. I also needed to keep telling it 
to set the index to the common value for each and every df including the newly 
joined series. And, due to size, I chose to keep deleting df no longer in use 
but that would not be garbage collected.

 

I then looked again at how to tighten it up in a more pythonic way. In English 
(my sixth language since we are talking about languages 😉 ) I did some things 
linearly then shifted it to a list method. I used lists of file names and lists 
of the df made from each file after removing unwanted columns. (NOTE: I use 
“column” but depending on language and context I mean variable or field or axis 
or many other ways to say a group of related information in a tabular structure 
that crosses rows or instances.)

 

So I was able to do my multi-step join more like this:

 

join_with_list = dflist[1:]

current = df1

suffix = 1

 

for df in join_with_list:

    current = current.join(df, how='outer', rsuffix='_'+str(suffix))

    suffix += 1

    current.set_index('ID')

 

In this formulation, the intermediate DataFrame objects held in current will 
silently be garbage collected as nothing points to them, for example. Did I 
mention these were huge files?

 

The old code was much longer and error prone as I had a df1, df2, … df8 as well 
as other intermediates and was easy to copy and paste then not edit the changes 
properly.

 

On to the main point. Saving as a CSV was trivial. Saving as an XLSX took some 
work BECAUSE I had pandas but apparently was missing some components it needed. 
I had to figure out what was needed and get it and finally got it working 
nicely.

 

In this case, I am sure I might have been able to figure out how to make my R 
environment work in my Windows 10 machine or installed the other software 
needed or move my development into Cygwin or even used Linux. 

 

Now for the main point. As hinted above, sometimes the functionality in a 
particular programming language like R or Python already requires the use of 
external resources that can include parts made in other languages. Python makes 
extensive use of internal parts being made in C for speed and the interpreter 
itself is in C or C++. Similarly R uses C and C++. So the next step can be to 
integrate the languages. They have strengths and weaknesses and I know both are 
used heavily in Machine Learning which is part of my current focus. Some things 
are best done when your language matches the method/algorithm you use. R has 
half a dozen object-oriented variations and generally they are not as useful as 
the elegant and consistent OO model in Python. But it has strengths that let me 
make weird graphics in at least three different major philosophies (base, 
lattice and ggplot) and lots of nice things and ways based on the underlying 
philosophy being that everything is a vector and most operations are vectorized 
or can be. Handy. It can lead to modes of thinking about a problem quite 
different than the pythonic way.

 

So, I can now load an R package that lets me shift back and forth in the same 
session to a now embedded python Interpreter. Many “objects” can be created and 
processed in one language, then passed through the veil to functionality in the 
other and back and forth. You can load modules within the python component and 
load packages in the R component. Within reason, you use the best tools for the 
job. If part is best done with sets or dictionaries or generators, do it on the 
python side. Want to use ggplot2 for graphics, do it on the R side.

 

In reality, there are many more programming languages (especially dormant ones) 
than are really needed. But they evolve and some are designed more for some 
tasks than others. So, if you want to use Python, why not see if you can use it 
the way it is loosely intended. If all you are doing is rewriting it to fit the 
mold of another language, why not use that, if still available.

 

Do I have a favorite language? Not really. I note that in attempts to improve 
Python (and other languages too) over the years, they keep adding and often in 
ways that sort of change enough so there is no longer as clear a philosophy. I 
can see heavy borrowing from many horse breeds and making a camel. So there 
isn’t really ONE pythonic way for many things. We have two completely separate 
ways to format strings that end up with fairly similar functionality. Actually, 
there is an implicit third way 😊

 

So I think there are multiple overlapping sets of what it means to be pythonic. 
If you come from a procedural background, you can do much without creating 
objects or using functional programming skills. If you come from an OO 
background, you can have fun making endless classes and subclasses to do just 
about everything including procedural things the hard way by creating one 
master object that controls lots of lave objects. If you lie functional 
programming with factories that churn out functions that may retain their 
variables, again, have fun. There are other such paradigms supported including 
lots of miniature sub-languages you can create with regular expressions being 
an example, as well as the print formatting methods. To be fluent in python, 
though, you need to be able to use other people’s existing code and perhaps be 
able to modify or emulate it. That effectively means being open to multiple 
ways so in a sense, being pythonic includes being flexible, to a point.

 

Good place to stop and resume my previously scheduled life.

 

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

[Tutor] Pythonic way

Reply via email to