Re: [Rd] JDataFrame API

Thomas Fuller Fri, 15 Jan 2016 13:39:12 -0800

Hi Simon,

Aha! I re-read your message and noticed this line:


lapply(J("A")$direct(), .jevalArray)

which I had overlooked earlier. I wrote an example that is very
similar to yours and see what you mean now regarding how we can do
this directly.

Many thanks,

T

groovyScript <- paste (
    "def stringList = [] as java.util.List",
    "def numberList = [] as java.util.List",
    "for (def ctr in 0..99) { stringList << new String(\"TGIF $ctr\");
numberList << ctr; }",
    "def strings = stringList.toArray()",
    "def numbers = numberList.toArray()",
    "def result = [strings, numbers]",
    "return (Object[]) result",
    sep="\n")

result <- Evaluate (groovyScript=groovyScript)

temp <- lapply(result, .jevalArray)

On Fri, Jan 15, 2016 at 1:58 PM, Simon Urbanek
<simon.urba...@r-project.org> wrote:
>
>> On Jan 15, 2016, at 12:35 PM, Thomas Fuller 
>> <thomas.ful...@coherentlogic.com> wrote:
>>
>> Hi Simon,
>>
>> Thanks for your feedback. -- this is an observation that I wasn't
>> considering when I wrote this mainly because I am, in fact, working
>> with rather small data sets. BTW: There is code there, it's under the
>> bitbucket link -- here's the direct link if you'd still like to look
>> at it:
>>
>> https://bitbucket.org/CoherentLogic/jdataframe
>>
>
> Ah, sorry, all links just send you back to the page, so I missed the little 
> filed that tells you how to check it out.
>
>
>> Re "for practical purposes is doesn't seem like the most efficient
>> solution" and "So the JSON route is very roughly ~13x slower than
>> using Java directly."
>>
>> I've not benchmarked this and will take a closer look at what you have
>> today -- in fact I may include these details on the JDataFrame page.
>> The JDataFrame targets the use case where there's significant
>> development being done in Java and data is exported into R and,
>> additionally, the developer intends to keep the two separated as much
>> as possible. I could work with Java directly, but then I potentially
>> end up with quite a bit of Java code taking up space in R and I don't
>> like this because if I need to refactor something I have to do it in
>> two places.
>>
>
> No, the code is the same - it makes no difference. The R code is only one 
> call to fetch what you need by calling your Java method. The nice thing is 
> that you in fact save some code since there is no reason to serialize since 
> you can simply access all Java objects directly without serialization.
>
>
>> There's another use case for the JDataFrame as well and that's in an
>> enterprise application (you may have alluded to this when you said
>> "[i]f you need process separation..."). Consider a business where
>> users are working with R and the application that produces the data is
>> actually running in Tomcat. Shipping large amounts of data over the
>> wire in this example would be a performance destroyer, but for small
>> data sets it certainly would be helpful from a development perspective
>> to expose JSON-based web services where the R script would be able to
>> convert a result into a data frame gracefully.
>>
>
> Yes, sure, that makes sense. Like I said, I would probably use some native 
> format in that case if I worried about performance. Some candidates that come 
> to my mind are ProtoBuf and QAP (serialization used by Rserve). If you have 
> arrays, you can always serialize them directly which may be most efficient, 
> but you'd probably have to write the wrapper for that yourself (annoyingly, 
> the default Java methods use big-endian format which is slower on most 
> machines). But then, you're right that for Tomcat applications the sizes are 
> small enough that using JSON has the benefit that you can inspect payload by 
> eye and/or other tools very easily.
>
> Cheers,
> Simon
>
>
>>
>> On Fri, Jan 15, 2016 at 10:58 AM, Simon Urbanek
>> <simon.urba...@r-project.org> wrote:
>>> Tom,
>>>
>>> this may be good for embedding small data sets, but for practical purposes 
>>> is doesn't seem like the most efficient solution.
>>>
>>> Since you didn't provide any code, I built a test case using the build-in 
>>> Java JSON API to build a medium-sized dataset (1e6 rows) and read it in 
>>> just to get a ballpark (see
>>> https://gist.github.com/s-u/4efb284e3c15c6a2db16
>>>
>>> # generate:
>>> time java -cp .:javax.json-api-1.0.jar:javax.json-1.0.4.jar A > 1e6
>>>
>>> real    0m2.764s
>>> user    0m20.356s
>>> sys     0m0.962s
>>>
>>> # read:
>>>> system.time(temp <- RJSONIO::fromJSON("1e6"))
>>>   user  system elapsed
>>>  3.484   0.279   3.834
>>>> str(temp)
>>> List of 2
>>> $ V1: num [1:1000000] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 ...
>>> $ V2: chr [1:1000000] "X0" "X1" "X2" "X3" ...
>>>
>>> For comparison using Java directly (includes both generation and reading 
>>> into R):
>>>
>>>> system.time(temp <- lapply(J("A")$direct(), .jevalArray))
>>>   user  system elapsed
>>>  0.962   0.186   0.494
>>>
>>> So the JSON route is very roughly ~13x slower than using Java directly. 
>>> Obviously, this will vary by data set type etc. since there is R overhead 
>>> involved as well: for example, if you have only numeric variables, the JSON 
>>> route is 30x slower on reading alone [50x total]. String variables slow 
>>> down everyone equally. Interestingly, the JSON encoding is using all 16 
>>> cores, so the 2.7s real time add up to over 20s CPU time so on smaller 
>>> machines you may see more overhead.
>>>
>>> If you need process separation, it may be a different story - in principle 
>>> it is faster to use more native serialization than JSON since parsing is 
>>> the slowest part for big datasets.
>>>
>>> Cheers,
>>> Simon
>>>
>>>
>>>> On Jan 14, 2016, at 4:52 PM, Thomas Fuller 
>>>> <thomas.ful...@coherentlogic.com> wrote:
>>>>
>>>> Hi Folks,
>>>>
>>>> If you need to send data from Java to R you may consider using the
>>>> JDataFrame API -- which is used to convert data into JSON which then
>>>> can be converted into a data frame in R.
>>>>
>>>> Here's the project page:
>>>>
>>>> https://coherentlogic.com/middleware-development/jdataframe/
>>>>
>>>> and here's a partial example which demonstrates what the API looks like:
>>>>
>>>> String result = new JDataFrameBuilder()
>>>>   .addColumn("Code", new Object[] {"WV", "VA", })
>>>>   .addColumn("Description", new Object[] {"West Virginia", "Virginia"})
>>>>   .toJson();
>>>>
>>>> and in R script we would need to do this:
>>>>
>>>> temp <- RJSONIO::fromJSON(json)
>>>> tempDF <- as.data.frame(temp)
>>>>
>>>> which yields a data frame that looks like this:
>>>>
>>>>> tempDF
>>>>   Description Code
>>>> 1 West Virginia   WV
>>>> 2      Virginia   VA
>>>>
>>>> It is my intention to deploy this project to Maven Central this week,
>>>> time permitting.
>>>>
>>>> Questions and comments are welcomed.
>>>>
>>>> Tom
>>>>
>>>> ______________________________________________
>>>> R-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>
>

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] JDataFrame API

Reply via email to