I don't have enough experience in this field to comment very sensibly
(or offer to mentor), but I wanted to say that I think it would be
really nice to have a broader selection of specialised persistent data
structures available for Clojure.

On 12 March 2014 09:56, Matteo Ceccarello <matteo.ceccare...@gmail.com> wrote:
> The student application opened, so I'm going to submit my proposal.
> Does anyone have suggestions or requests?
>
> Cheers,
> Matteo
>
> Il giorno lunedì 3 marzo 2014 15:36:43 UTC+1, Matteo Ceccarello ha scritto:
>>
>> Hello everybody,
>>
>> I'm Matteo Ceccarello, a PhD student in Computer Engineering from
>> university of
>> Padova, Italy. Recently, I've started working with Clojure, a language
>> that I
>> find both powerful and fun. I've participated to GSoC 2013 and 2012 with
>> JPF. I'm interested in
>> participating to GSoC 2014 with Clojure, in order to gain confidence with
>> clojure under the guidance of a mentor while contributing to the
>> community. I'll
>> describe what I'd like to do below.
>>
>> Motivation
>>
>> Currently, alongside my research on parallel graph algorithms, I've
>> started
>> writing a web crawler in Clojure. The reasons I want to do this in clojure
>> are
>> many:
>>
>> The web crawling software we are using
>> (Heritrix) performs blocking IO. I think that in
>> big crawls this is a showstopper, since to achieve a good throughput one
>> must
>> use many Java threads (in the order of 500) that end up spending most of
>> their
>> time performing blocking waits.
>> Clojure has the wonderful core.async library, with all the magic that go
>> blocks and channels bring.
>> The http-kit library makes possible to perform async requests to web
>> servers in a very simple way.
>> Clojure has an awesome support for concurrency.
>>
>> What I want to achieve eventually is a concurrent asynchronous web
>> crawler,
>> without a single blocking call.
>>
>> Since I'm new to the language, before starting to get some real work done,
>> I've
>> worked on a few small projects, to get a feel of the language. What I
>> learned is
>> that everything you put in a ref or atom must be immutable. A fundamental
>> component of a web crawler is a data structure that tells you if you have
>> already visited a URL. Bloom filters are a common choice for the
>> implementation
>> of this component. Since this bloom filter will be put in a ref, I want it
>> to
>> be immutable and, for efficiency reasons, persistent. After some research
>> on the
>> web, I found out that an immutable persistent implementation of the bloom
>> filter
>> is still missing from the Clojure ecosystem.
>>
>> Since it seems that Clojure is missing a persistent implementation of many
>> probabilistic data structures, I came up with the following idea.
>>
>> Persistent probabilistic data structures for Clojure
>>
>> Clojure seems to miss persistent implementations of many useful
>> probabilistic
>> data structures, in particular:
>>
>> Bloom filters
>> Counting Bloom filters
>> Compact approximators
>> HyperLogLog counters
>> Count-min sketches
>>
>> Of course one can use one of the various implementations of these data
>> structures for
>> Java, however, being these implementations mutable, they cannot be used
>> in idiomatic concurrent clojure code (as for my understanding of idiomatic
>> concurrent clojure code).
>>
>> What I propose to realize is an optimized persistent implementation of
>> these
>> libraries. Hence I plan to explore different paths using
>> benchmarks as a guide,
>> for instance to decide whether it's more convenient to use standard
>> clojure vectors
>> to represent bit vectors or to provide a custom persistent
>> bit vector implementation.
>>
>> Since I like the feeling of having a static type checker that
>> prevents common errors, the library will be annotated with
>> Typed Clojure. Moreover I find interesting the idea
>> of using tests as machine checkable documentation, and
>> midje-doc seems the right tool for the job.
>>
>> Is there someone interested in mentoring me with this project?
>>
>> Yours sincerely
>> Matteo
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to