This is not a question or reply. Nor is it short. If not interested, feel free
to delete.
It is an observation based on recent experiences.
We have had quite a few messages that pointed out how some people approach
solving a problem using subconscious paradigms inherited from their past. This
includes people who never programmed and are thinking of how they might do it
manually as well as people who are proficient in one or more other computer
languages and their first attempt to visualize the solution may lead along
paths that are possibly doable in Python but not optimal or even suggested.
I recently had to throw together a routine that would extract info from
multiple SAS data files and join them together on one key into a large
DataFrame (or data.frame or other such names for a tabular object.). Then I
needed to write them out to disk either as a CSV or XLSX file for future use.
Since I have studied and used (not to mention abused) many programming
languages, my first thought was to do this in R. It has lots of the tools
needed to do such things including packages (sort of like modules you can
import but not exactly) and I have done many data/graphics programs in it. I
then redid it in Python after some thought.
The pseudocode outline is:
* Read in all the files into a set of data.frame objects.
* Trim back the variables/columns of some of them as many are not needed.
* Join them together on a common index using a full or outer join.
* Save the result on disk as a Comma Separated Values file.
* Save the result on disk as a named tab in a new style EXCEL file.
I determined some of what I might us such as the needed built-in commands,
packages and functions I could use for the early parts but ran into an
annoyance as some of the files contained duplicate entries. Luckily, the R
function reduce (not the same as map/reduce) is like many things in R and takes
a list of items and makes it work. Also, by default, it renames duplicates so
if you have ALPHA in multiple places, it names them ALPHA.x and ALPHA.x.x and
other variations.
df_joined <- reduce(df_list, full_join, by = "COMMON")
Mind you, when I stripped away many of the columns not needed in some of the
files, there were fewer duplicates and a much smaller output file.
But I ran into a wall later. Saving into a CSV is trivial. There are multiple
packages meant to be used for saving into a XLSX file but they all failed for
me. One wanted something in JAVA and another may want PERL and some may want
packages I do not have installed. So, rather than bash my head against the
wall, Itwisted and used the best XSLX maker there is. I opened the CSV file in
EXCEL manually and did a SAVE AS …
Then I went to plan C (no, not the language C or its many extensions like C++)
as I am still learning Python and have not used it much. As an exercise, I
decided to learn how to do this in Python using tools/modules like numpy and
pandas that I have not had much need for as well as additional tools for
reading and writing files in other formats.
My first attempts gradually worked, after lots of mistakes and looking at
manual pages. It followed an eclectic set of paradigms but worked. Not
immediately, as I ran into a problem in that the pandas version of a join did
not tolerate duplicate column names when used on a list. I could get it to
rename the left or right list (adding a suffix a suffix) when used on exactly
two DataFrames. So, I needed to take the first df and do a df.join(second, …)
then take that and join the third and so on. I also needed to keep telling it
to set the index to the common value for each and every df including the newly
joined series. And, due to size, I chose to keep deleting df no longer in use
but that would not be garbage collected.
I then looked again at how to tighten it up in a more pythonic way. In English
(my sixth language since we are talking about languages 😉 ) I did some things
linearly then shifted it to a list method. I used lists of file names and lists
of the df made from each file after removing unwanted columns. (NOTE: I use
“column” but depending on language and context I mean variable or field or axis
or many other ways to say a group of related information in a tabular structure
that crosses rows or instances.)
So I was able to do my multi-step join more like this:
join_with_list = dflist[1:]
current = df1
suffix = 1
for df in join_with_list:
current = current.join(df, how='outer', rsuffix='_'+str(suffix))
suffix += 1
current.set_index('ID')
In this formulation, the intermediate DataFrame objects held in current will
silently be garbage collected as nothing points to them, for example. Did I
mention these were huge files?
The old code was much longer and error prone as I had a df1, df2, … df8 as well
as other intermediates and was easy to copy and paste then not edit the changes
properly.
On to the main point. Saving as a CSV was trivial. Saving as an XLSX took some
work BECAUSE I had pandas but apparently was missing some components it needed.
I had to figure out what was needed and get it and finally got it working
nicely.
In this case, I am sure I might have been able to figure out how to make my R
environment work in my Windows 10 machine or installed the other software
needed or move my development into Cygwin or even used Linux.
Now for the main point. As hinted above, sometimes the functionality in a
particular programming language like R or Python already requires the use of
external resources that can include parts made in other languages. Python makes
extensive use of internal parts being made in C for speed and the interpreter
itself is in C or C++. Similarly R uses C and C++. So the next step can be to
integrate the languages. They have strengths and weaknesses and I know both are
used heavily in Machine Learning which is part of my current focus. Some things
are best done when your language matches the method/algorithm you use. R has
half a dozen object-oriented variations and generally they are not as useful as
the elegant and consistent OO model in Python. But it has strengths that let me
make weird graphics in at least three different major philosophies (base,
lattice and ggplot) and lots of nice things and ways based on the underlying
philosophy being that everything is a vector and most operations are vectorized
or can be. Handy. It can lead to modes of thinking about a problem quite
different than the pythonic way.
So, I can now load an R package that lets me shift back and forth in the same
session to a now embedded python Interpreter. Many “objects” can be created and
processed in one language, then passed through the veil to functionality in the
other and back and forth. You can load modules within the python component and
load packages in the R component. Within reason, you use the best tools for the
job. If part is best done with sets or dictionaries or generators, do it on the
python side. Want to use ggplot2 for graphics, do it on the R side.
In reality, there are many more programming languages (especially dormant ones)
than are really needed. But they evolve and some are designed more for some
tasks than others. So, if you want to use Python, why not see if you can use it
the way it is loosely intended. If all you are doing is rewriting it to fit the
mold of another language, why not use that, if still available.
Do I have a favorite language? Not really. I note that in attempts to improve
Python (and other languages too) over the years, they keep adding and often in
ways that sort of change enough so there is no longer as clear a philosophy. I
can see heavy borrowing from many horse breeds and making a camel. So there
isn’t really ONE pythonic way for many things. We have two completely separate
ways to format strings that end up with fairly similar functionality. Actually,
there is an implicit third way 😊
So I think there are multiple overlapping sets of what it means to be pythonic.
If you come from a procedural background, you can do much without creating
objects or using functional programming skills. If you come from an OO
background, you can have fun making endless classes and subclasses to do just
about everything including procedural things the hard way by creating one
master object that controls lots of lave objects. If you lie functional
programming with factories that churn out functions that may retain their
variables, again, have fun. There are other such paradigms supported including
lots of miniature sub-languages you can create with regular expressions being
an example, as well as the print formatting methods. To be fluent in python,
though, you need to be able to use other people’s existing code and perhaps be
able to modify or emulate it. That effectively means being open to multiple
ways so in a sense, being pythonic includes being flexible, to a point.
Good place to stop and resume my previously scheduled life.
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor