17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

serge . stinckwich Tue, 16 May 2017 10:24:07 -0700

I was asking Philippe but hope to see you also at ESUG !

Envoyé de mon iPhone


> Le 16 mai 2017 à 19:02, Oleksandr Zaytsev <[email protected]> a écrit :
> 
> I would love to, but to go to Lille from my country I would need a visa. 
> Which is not that easy to acquire.
> So maybe I will come to PharoDays 2018.
> And I will definitely try to come to ESUG Conference in September.
> 
> Oleks
> 
>> On Tue, May 16, 2017 at 7:26 PM, <[email protected]> wrote:
>> 
>> 
>> Envoyé de mon iPhone
>> 
>>> Le 11 mai 2017 à 11:43, "[email protected]" <[email protected]> a écrit :
>>> 
>>> ---------- Message transféré ----------
>>> De : "[email protected]" <[email protected]>
>>> Date : 11 mai 2017 10:54
>>> Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis - 
>>> Oleksandr Zaytsev
>>> À : "Nick Papoylias" <[email protected]>
>>> Cc : 
>>> 
>>> 
>>> 
>>>> On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <[email protected]> 
>>>> wrote:
>>>> 
>>>> 
>>>>> On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev 
>>>>> <[email protected]> wrote:
>>>>> A. Work done
>>>>> Downloaded the threaded VM as suggested by Esteban Lorenzano to make 
>>>>> Iceberg work. And it does! I have successfully pushed my NeuralNetwork 
>>>>> code to GitHub: https://github.com/olekscode/MLNeuralNetwork
>>>>> Joined the PolyMath organization on GitHub
>>>>> Created a repository for the TabularDataset project 
>>>>> https://github.com/PolyMathOrg/TabularDataset as a part of PolyMath 
>>>>> organization on GitHub
>>>>> Fixed a PolyMath issue #25 and made a PR
>>>>> Read an article from Wolfram Mathematica documentation regarding Dataset. 
>>>>> It was one of the reading suggestions sent to me by Nick Papoylias 
>>>>> B. Next steps
>>>>> Fix more issues of PolyMath, using Iceberg. I have to get used to it by 
>>>>> the time the coding phase starts 
>>>>> Read the rest of Nick Papoylias's suggestions
>>>>> C. Help needed
>>>>> The Dataset in Wolfram, as well as Pandas in Python, has a very advanced 
>>>>> indexing system. Smalltalk has its own special conventions for indexing, 
>>>>> so I think that it would be great if I got familiar with them. Could you 
>>>>> suggest me some reading on this topic (what are the indexing conventions 
>>>>> in Smalltalk?).
>>>>> For example, in Wolfram, I can write dataset[[-1]] to extract the last 
>>>>> row. But in Pharo indexes can not be negative. In Pharo I would say 
>>>>> dataset last. But how about dataset[[-5]]?
>>>> This would be a good exercise for you ;) In Pharo you can easily add 
>>>> negative indexing yourself. 
>>>> 
>>>> Hint: You know the index of the last element, since this is the size of 
>>>> the collection, so... ;)
>>>> 
>>> No need for changes, this exists already.
>>> 
>>> Use atWrap: index put: value and atWrap: with negative indexes.
>>> 'hello' atWrap: -2
>>> 
>>> There is a specific version for Array using a primitive.
>>> #[ 10 20 30 40 ] atWrap: -1
>>> 
>>> atWrap:0 gives you the last item.
>>> atWrap: -1 gives 30
>>> 
>>> This is different from 0 based index languages.
>>> 
>>> The interesing thing about atWrap: is that it uses modulo interally so you 
>>> do not need to care about that.
>>> 
>>> ($/ split: 'abc/def/ghi/jkl') atWrap: -1 
>>> --> 'ghi'
>>> 
>>> The Matrix class has a bunch of things API wise but the class is highly 
>>> inefficient, doing copies all the time etc. It would be nice to have some 
>>> kind of futures/copy on write style things in there.
>>> 
>>> I miss cbind and rbind. These are useful. I have some half baked super 
>>> inefficient implementations of these things for Matrix.
>>> 
>>> https://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html
>>> 
>>> The ability to name columns is also nice to have.
>>> 
>>> In R one does: 
>>> 
>>> df <- dataframe()
>>> cbind(df, c(1,2,3))
>>> cbind(df, c(4,5,6))
>>> names(df)<-("C1", "C2", "C3")
>>> names can be found back with:
>>> 
>>> names(df)
>>> 
>>> A Smalltalkish style would be welcome.
>>> 
>> 
>> 
>> 
>> Interesting ! Are you coming to PharoDays ? We can talk about that if we 
>> found time.
>> 
>>> Maybe looking at the Voyage queries can be helpful. 
>>> 
>>> Phil
>>>  
>>>  
>>>> Try adding an extention method to Ordrered or SequenceableCollection.
>>>> 
>>>> If the Pharo by example chapter is not enough or the MOOC, read the source
>>>> itself in the core, to see how basic methods are implemented (it is less 
>>>> scary,
>>>> than it sounds).
>>>> 
>>>> You can also try Chapters 9, 10, 11 of the blue book (some API changes may 
>>>> apply):
>>>> 
>>>> http://sdmeta.gforge.inria.fr/FreeBooks/BlueBook/Bluebook.pdf
>>>> 
>>>>> Or what is the best way of implementing this index: dataset[["name"]] 
>>>>> (extracts a named row), dataset[[1]] (extracts the first row)? Should I 
>>>>> create two separate messages: dataset rowNamed: 'name' and dataset rowAt: 
>>>>> 1?
>>> rowNamed:
>>> rowAt: 
>>> 
>>> yes, look like it.
>>> 
>>> But if we want to model things like R dataframes for example, this has to 
>>> be seen as a vectorized operation, so you can to use row slices, column 
>>> slices, and logical indexes.
>>> 
>>> Check this out: 
>>> 
>>> http://www.r-tutor.com/r-introduction/data-frame/data-frame-row-slice
>>> https://www.r-bloggers.com/working-with-data-frames/
>>> 
>>>  
>>>> The internal representation of your data-structure can be anything at the 
>>>> moment, as long as you encapsulate it.
>>>> 
>>>> (ie it can be nested OrderedCollections with meta-data for column-names to 
>>>> indexes, or dictionary of collections etc). 
>>>> 
>>>> If you don't expose it to the user (ie return it from the public api, or 
>>>> expect knowledge of it in argument passing), 
>>>> we can easily change it later. So first make it work, and we optimize 
>>>> later ;)
>>>> 
>>>> For your case it will be a little bit trickier because you also have the 
>>>> notions of a) rows and b) columns, which
>>>> are exposed to the user. So you would need to create abstractions for 
>>>> these too.
>>>> 
>>>> Cheers,
>>>> 
>>>> Nick
>>>>> 
>>>>> 
>>>>> If someone else is having problems with Iceberg on Linux, try downloading 
>>>>> the threaded VM:
>>>>> wget -O- get.pharo.org/vmT60 | bash
>>>>> And use SSH (not HTTPS) remote URL.
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "Pharo Google Summer of Code" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>>> email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "Pharo Google Summer of Code" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>> email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>> 
>

Re: [Pharo-users] Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Reply via email to