Re: help: pandas and 2d table

jak via Python-list Tue, 16 Apr 2024 07:45:47 -0700

Stefan Ram ha scritto:

jak <[email protected]> wrote or quoted:

Stefan Ram ha scritto:

df = df.where( df == 'zz' ).stack().reset_index()
result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}

Since I don't know Pandas, I will need a month at least to understand
these 2 lines of code. Thanks again.


   Here's a technique to better understand such code:

   Transform it into a program with small statements and small
   expressions with no more than one call per statement if possible.
   (After each litte change check that the output stays the same.)

import pandas as pd

# Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
     print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )
# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

selection = df.where( df == 'zz' )
selection_stack = selection.stack()
df = selection_stack.reset_index()
df0 = df.iloc[ :, 0 ]
df1 = df.iloc[ :, 1 ]
z = zip( df0, df1 )
l = list( z )
result ={ 'zz': l }
print( result )

   I suggest to next insert print statements to print each intermediate
   value:

# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
print( 'df = \n', type( df ), ':\n"', df, '"\n' )

selection = df.where( df == 'zz' )
print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
   selection, '"\n' )

selection_stack = selection.stack()
print( 'result of stack() = \n', type( selection_stack ), ':\n"',
   selection_stack, '"\n' )

df = selection_stack.reset_index()
print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )

df0 = df.iloc[ :, 0 ]
print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )

df1 = df.iloc[ :, 1 ]
print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )

z = zip( df0, df1 )
print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )

l = list( z )
print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )

result ={ 'zz': l }
print( "value of { 'zz': l }= \n", type( result ), ':\n"',
   result, '"\n' )

print( result )

   Now you can see what each single step does!

df =
  <class 'pandas.core.frame.DataFrame'> :
"      foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1   aa   ab   zz   ad   ae   af
foo2   ba   bb   bc   bd   zz   bf
foo3   ca   zz   cc   cd   ce   zz
foo4   da   db   dc   dd   de   df
foo5   ea   eb   ec   zz   ee   ef
foo6   fa   fb   fc   fd   fe   ff "

result of where( df == 'zz' ) =
  <class 'pandas.core.frame.DataFrame'> :
"      foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1  NaN  NaN   zz  NaN  NaN  NaN
foo2  NaN  NaN  NaN  NaN   zz  NaN
foo3  NaN   zz  NaN  NaN  NaN   zz
foo4  NaN  NaN  NaN  NaN  NaN  NaN
foo5  NaN  NaN  NaN   zz  NaN  NaN
foo6  NaN  NaN  NaN  NaN  NaN  NaN "

result of stack() =
  <class 'pandas.core.series.Series'> :
" obj
foo1  foo3    zz
foo2  foo5    zz
foo3  foo2    zz
       foo6    zz
foo5  foo4    zz
dtype: object "

result of reset_index() =
  <class 'pandas.core.frame.DataFrame'> :
"     obj level_1   0
0  foo1    foo3  zz
1  foo2    foo5  zz
2  foo3    foo2  zz
3  foo3    foo6  zz
4  foo5    foo4  zz "

value of .iloc[ :, 0 ]=
  <class 'pandas.core.series.Series'> :
" 0    foo1
1    foo2
2    foo3
3    foo3
4    foo5
Name: obj, dtype: object "

value of .iloc[ :, 1 ] =
  <class 'pandas.core.series.Series'> :
" 0    foo3
1    foo5
2    foo2
3    foo6
4    foo4
Name: level_1, dtype: object "

result of zip( df0, df1 )=
  <class 'zip'> :
" <zip object at 0x000000000B3B9548>"

result of list( z )=
  <class 'list'> :
" [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 
'foo4')]"

value of { 'zz': l }=
  <class 'dict'> :
" {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), 
('foo5', 'foo4')]}"

{'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), 
('foo5', 'foo4')]}

   The script reads a CSV file and stores the data in a Pandas
   DataFrame object named "df". The "index_col=0" parameter tells
   Pandas to use the first column as the index for the DataFrame,
   which is kinda like column headers.

   The "where" creates a new DataFrame selection that contains
   the same data as df, but with all values replaced by NaN (Not
   a Number) except for the values that are equal to 'zz'.

   "stack" returns a Series with a multi-level index created
   by pivoting the columns. Here it gives a Series with the
   row-col-addresses of a all the non-NaN values. The general
   meaning of "stack" might be the most complex operation of
   this script. It's explained in the pandas manual (see there).

   "reset_index" then just transforms this Series back into a
   DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
   first and second column, respectively, of that DataFrame. These
   then are zipped to get the desired form as a list of pairs.


And this is a technique very similar to reverse engineering. Thanks for
the explanation and examples. All this is really clear and I was able to
follow it easily because I have already written a version of this code
in C without any kind of external library that uses the .CSV version of
the table as data ( 234 code lines :^/ ).


--
https://mail.python.org/mailman/listinfo/python-list

Re: help: pandas and 2d table

Reply via email to