Re: Belief propagation algorithm is open sourced

2016-12-15 Thread Bertrand Dechoux
Nice! I am especially interested in Bayesian Networks, which are only one
of the many models that can be expressed by a factor graph representation.
Do you do Bayesian Networks learning at scale (parameters and structure)
with latent variables? Are you using publicly available tools for that?
Which ones?

LibDAI, which created the supported format, "supports parameter learning of
conditional probability tables by Expectation Maximization" according to
the documentation. Is it your reference tool?

Bertrand

On Thu, Dec 15, 2016 at 5:21 AM, Bryan Cutler  wrote:

> I'll check it out, thanks for sharing Alexander!
>
> On Dec 13, 2016 4:58 PM, "Ulanov, Alexander" 
> wrote:
>
>> Dear Spark developers and users,
>>
>>
>> HPE has open sourced the implementation of the belief propagation (BP)
>> algorithm for Apache Spark, a popular message passing algorithm for
>> performing inference in probabilistic graphical models. It provides exact
>> inference for graphical models without loops. While inference for graphical
>> models with loops is approximate, in practice it is shown to work well. The
>> implementation is generic and operates on factor graph representation of
>> graphical models. It handles factors of any order, and variable domains of
>> any size. It is implemented with Apache Spark GraphX, and thus can scale to
>> large scale models. Further, it supports computations in log scale for
>> numerical stability. Large scale applications of BP include fraud detection
>> in banking transactions and malicious site detection in computer
>> networks.
>>
>>
>> Source code: https://github.com/HewlettPackard/sandpiper
>>
>>
>> Best regards, Alexander
>>
>


SparkContext#cancelJobGroup : is it safe? Who got burn? Who is alive?

2016-06-14 Thread Bertrand Dechoux
Hi,

I am wondering about the safety of the *SparkContext#cancelJobGroup* method
that should allow to stop specific (ie not all) jobs inside a spark context.

There is a big disclaimer (
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/SparkContext.html#setJobGroup(java.lang.String,%20java.lang.String,%20boolean)
.

If interruptOnCancel is set to true for the job group, then job
> cancellation will result in Thread.interrupt() being called on the job's
> executor threads. This is useful to help ensure that the tasks are actually
> stopped in a timely manner, but is off by default due to HDFS-1208, where
> HDFS may respond to Thread.interrupt() by marking nodes as dead.


I have two main questions :

   1. What is the expected behavior if it is not interrupted on cancel? I
   am especially curious about the YARN case with HDFS but any info is welcome.
   2. Who is or was using the *interruptOnCancel* ? Do you got burn? It is
   still working without any incident?

Thanks in advance for any info, feedbacks and war stories.

Bertrand Dechoux


Re: Replacing Esper with Spark Streaming?

2015-09-15 Thread Bertrand Dechoux
The big question would be what feature of Esper your are using. Esper is a
CEP solution. I doubt that Spark Streaming can do everything Esper does
without any development. Spark (Streaming) is more a general-purpose
platform.

http://www.espertech.com/products/esper.php

But I would be glad to be proven wrong (which also would implies EsperTech
is dead, which I also doubt...)

Bertrand

On Mon, Sep 14, 2015 at 2:31 PM, Todd Nist  wrote:

> Stratio offers a CEP implementation based on Spark Streaming and the
> Siddhi CEP engine.  I have not used the below, but they may be of some
> value to you:
>
> http://stratio.github.io/streaming-cep-engine/
>
> https://github.com/Stratio/streaming-cep-engine
>
> HTH.
>
> -Todd
>
> On Sun, Sep 13, 2015 at 7:49 PM, Otis Gospodnetić <
> otis.gospodne...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm wondering if anyone has attempted to replace Esper with Spark
>> Streaming or if anyone thinks Spark Streaming is/isn't a good tool for the
>> (CEP) job?
>>
>> We are considering Akka or Spark Streaming as possible Esper replacements
>> and would appreciate any input from people who tried to do that with either
>> of them.
>>
>> Thanks,
>> Otis
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>


Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Bertrand Dechoux
Well, anyone can open an account on apache jira and post a new
ticket/enhancement/issue/bug...

Bertrand Dechoux


On Fri, Jul 25, 2014 at 4:07 PM, Sparky gullo_tho...@bah.com wrote:

 Thanks for the suggestion.  I can confirm that my problem is I have files
 with zero bytes.  It's a known bug and is marked as a high priority:

 https://issues.apache.org/jira/browse/SPARK-1960



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-I-list-all-files-in-hdfs-directory-tp10648p10651.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Bertrand Dechoux
 Is there any documentation from cloudera on how to run Spark apps on CDH
Manager deployed Spark ?

Asking the cloudera community would be a good idea.
http://community.cloudera.com/

In the end only Cloudera will fix quickly issues with CDH...

Bertrand Dechoux


On Wed, Jul 23, 2014 at 9:28 AM, Debasish Das debasish.da...@gmail.com
wrote:

 I found the issue...

 If you use spark git and generate the assembly jar then
 org.apache.hadoop.io.Writable.class is packaged with it

 If you use the assembly jar that ships with CDH in
 /opt/cloudera/parcels/CDH/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.2-hadoop2.3.0-cdh5.0.2.jar,
 they don't put org.apache.hadoop.io.Writable.class in it..

 That's weird...

 If I can run the spark app using bare bone java I am sure it will run with
 Ooyala's job server as well...



 On Wed, Jul 23, 2014 at 12:15 AM, buntu buntu...@gmail.com wrote:

 If you need to run Spark apps through Hue, see if Ooyala's job server
 helps:



 http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-deployed-by-Cloudera-Manager-tp10472p10474.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





Re: Large scale ranked recommendation

2014-07-18 Thread Bertrand Dechoux
And you might want to apply clustering before. It is likely that every user
and every item are not unique.

Bertrand Dechoux


On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 It is very true that making predictions in batch for all 1 million users
 against the 10k items will be quite onerous in terms of computation. I have
 run into this issue too in making batch predictions.

 Some ideas:

 1. Do you really need to generate recommendations for each user in batch?
 How are you serving these recommendations? In most cases, you only need to
 make recs when a user is actively interacting with your service or product
 etc. Doing it all in batch tends to be a big waste of computation resources.

 In our system for example we are serving them in real time (as a user
 arrives at a web page, say, our customer hits our API for recs), so we only
 generate the rec at that time. You can take a look at Oryx for this (
 https://github.com/cloudera/oryx) though it does not yet support Spark,
 you may be able to save the model into the correct format in HDFS and have
 Oryx read the data.

 2. If you do need to make the recs in batch, then I would suggest:
 (a) because you have few items, I would collect the item vectors and form
 a matrix.
 (b) broadcast that matrix
 (c) do a mapPartitions on the user vectors. Form a user matrix from the
 vectors in each partition (maybe create quite a few partitions to make each
 user matrix not too big)
 (d) do a value call on the broadcasted item matrix
 (e) now for each partition you have the (small) item matrix and a (larger)
 user matrix. Do a matrix multiply and you end up with a (U x I) matrix with
 the scores for each user in the partition. Because you are using BLAS here,
 it will be significantly faster than individually computed dot products
 (f) sort the scores for each user and take top K
 (g) save or collect and do whatever with the scores

 3. in conjunction with (2) you can try throwing more resources at the
 problem too

 If you access the underlying Breeze vectors (I think the toBreeze method
 is private so you may have to re-implement it), you can do all this using
 Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot).

 Hope that helps

 Nick


 On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma sharm...@umn.edu wrote:

 Yes, thats what prediction should be doing, taking dot products or sigmoid
 function for each user,item pair. For 1 million users and 10 K items data
 there are 10 billion pairs.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Bertrand Dechoux
I haven't look at the implementation but what you would do with any
filesystem is write to a file inside the workspace directory of the task.
And then only the attempt of the task that should be kept will perform a
move to the final path. The other attempts are simply discarded. For most
filesystem (and that's the case for HDFS), a 'move' is a very simple and
fast action because only the full path/name of the file change but not
its content or where this content is physically stored.

Executive speculation happens in Hadoop MapReduce. Spark has the same
concept. As long as you apply functions with no side effect (ie the only
impact is the returned results), then you just need to not take into
account results from additional attempts of the same task/operator.

Bertrand Dechoux


On Tue, Jul 15, 2014 at 9:34 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Nan,

 Great digging in -- that makes sense to me for when a job is producing
 some output handled by Spark like a .count or .distinct or similar.

 For the other part of the question, I'm also interested in side effects
 like an HDFS disk write.  If one task is writing to an HDFS path and
 another task starts up, wouldn't it also attempt to write to the same path?
  How is that de-conflicted?


 On Tue, Jul 15, 2014 at 3:02 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  Hi, Mingyuan,

 According to my understanding,

 Spark processes the result generated from each partition by passing them
 to resultHandler (SparkContext.scala L1056)

 This resultHandler is usually just put the result in a driver-side array,
 the length of which is always partitions.size

 this design effectively ensures that the actions are idempotent

 e.g. the count is implemented as

 def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

 even the task in the partition is duplicately executed, the result put in
 the array is the same



 At the same time, I think the Spark implementation ensures that the
 operation applied on the return value of SparkContext.runJob will not be
 triggered when the duplicate tasks are finished

 Because,


 when a task is finished, the code execution path is
 TaskSetManager.handleSuccessfulTask - DAGScheduler.taskEnded

 in taskEnded, it will trigger the CompletionEvent message handler, where 
 DAGScheduler
 will check if (!job.finished(rt.outputid)) and rt.outputid is the
 partitionid

 so even the duplicate task invokes a CompletionEvent message, it will
 find job.finished(rt.outputId) has been true eventually


 Maybe I was wrong…just went through the code roughly, welcome to correct
 me

 Best,


 --
 Nan Zhu

 On Tuesday, July 15, 2014 at 1:55 PM, Mingyu Kim wrote:

 Hi all,

 I was curious about the details of Spark speculation. So, my
 understanding is that, when “speculated” tasks are newly scheduled on other
 machines, the original tasks are still running until the entire stage
 completes. This seems to leave some room for duplicated work because some
 spark actions are not idempotent. For example, it may be counting a
 partition twice in case of RDD.count or may be writing a partition to HDFS
 twice in case of RDD.save*(). How does it prevent this kind of duplicated
 work?

 Mingyu

 Attachments:
  - smime.p7s






Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Bertrand Dechoux
A patch proposal on the apache JIRA for Spark?
https://issues.apache.org/jira/browse/SPARK/

Bertrand


On Thu, Jul 10, 2014 at 2:37 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com
 wrote:

 And also that there is a small bug in implementation. As I mentioned this
 earlier also.

 This is my first time I am reporting some bug. So I just wanted to ask,
 that do your name come somewhere or do you get something after
 correcting/reporting some bug. So that i can mention that in my
 profile/resume( as I am moving towards my final year undergrad and will be
 sitting for company interviews.). If there is something like that which I
 can mention then I can go for the procedure you mentioned earlier.

 Otherwise I can simply mention the error here on the mailing list itself.
 Its a small bug (Bug according to me.)

 I m not trying to be selfish. Its just that if I get something that can
 help make my profile look strong then I shouldn't miss it at this stage.

 Thanks,



 On Thu, Jul 10, 2014 at 5:54 PM, Rahul Bhojwani 
 rahulbhojwani2...@gmail.com wrote:

 Ya thanks. I can see that lambda is used as the parameter.



 On Thu, Jul 10, 2014 at 1:40 PM, Sean Owen so...@cloudera.com wrote:

 Have a look at the code maybe?

 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

 Yes there is a smoothing parameter, and yes from the looks of it it is
 simply additive / Laplace smoothing. It's been in there for a while.


 On Thu, Jul 10, 2014 at 6:55 AM, Rahul Bhojwani 
 rahulbhojwani2...@gmail.com wrote:

 The discussion is in context for spark 0.9.1
 Does MLlib Naive Bayes implementation incorporates Laplase smoothing?
 Or any other smoothing? Or it doesn't encorporates any smoothing?? Please
 inform?

 Thanks,
 --
 Rahul K Bhojwani
 3rd Year B.Tech
 Computer Science and Engineering
 National Institute of Technology, Karnataka





 --
 Rahul K Bhojwani
 3rd Year B.Tech
 Computer Science and Engineering
 National Institute of Technology, Karnataka




 --
 Rahul K Bhojwani
 3rd Year B.Tech
 Computer Science and Engineering
 National Institute of Technology, Karnataka



Pig 0.13, Spark, Spork

2014-07-07 Thread Bertrand Dechoux
Hi,

I was wondering what was the state of the Pig+Spark initiative now that the
execution engine of Pig is pluggable? Granted, it was done in order to use
Tez but could it be used by Spark? I know about a 'theoretical' project
called Spork but I don't know any stable and maintained version of it.

Regards

Bertrand Dechoux


Re: Shark vs Impala

2014-06-22 Thread Bertrand Dechoux
For the second question, I would say it is mainly because the projects have
not the same aim. Impala does have a cost-based optimizer and predicate
propagation capability which is natural because it is interpreting
pseudo-SQL query. In the realm of relational database, it is often not a
good idea to compete against the optimizer, it is of course also true for
'BigData'.

Bertrand


On Sun, Jun 22, 2014 at 1:32 PM, Flavio Pompermaier pomperma...@okkam.it
wrote:

 Hi folks,
 I was looking at the benchmark provided by Cloudera at
 http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/
 .
 Is it real that Shark cannot execute some query if you don't have enough
 memory?
 And is it true/reliable that Impala overcome so much Spark when executing
 complex queries?

 Best,
 Flavio



Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Bertrand Dechoux
I guess you have to understand the difference of architecture. I don't know
much about C++ MPI but it is basically MPI whereas Spark is inspired from
Hadoop MapReduce and optimised for reading/writing large amount of data
with a smart caching and locality strategy. Intuitively, if you have a high
ratio CPU/message then MPI might be better. But what is the ratio is hard
to say and in the end this ratio will depend on your specific application.
Finally, in real life, this difference of performance due to the
architecture may not be the only or the most important factor of choice
like Michael already explained.

Bertrand

On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler mich...@tumra.com wrote:

 Hello Wei,

 I talk from experience of writing many HPC distributed application using
 Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
 Virtual Machine (PVM) way before that back in the 90's.  I can say with
 absolute certainty:

 *Any gains you believe there are because C++ is faster than Java/Scala
 will be completely blown by the inordinate amount of time you spend
 debugging your code and/or reinventing the wheel to do even basic tasks
 like linear regression.*


 There are undoubtably some very specialised use-cases where MPI and its
 brethren still dominate for High Performance Computing tasks -- like for
 example the nuclear decay simulations run by the US Department of Energy on
 supercomputers where they've invested billions solving that use case.

 Spark is part of the wider Big Data ecosystem, and its biggest
 advantages are traction amongst internet scale companies, hundreds of
 developers contributing to it and a community of thousands using it.

 Need a distributed fault-tolerant file system? Use HDFS.  Need a
 distributed/fault-tolerant message-queue? Use Kafka.  Need to co-ordinate
 between your worker processes? Use Zookeeper.  Need to run it on a flexible
 grid of computing resources and handle failures? Run it on Mesos!

 The barrier to entry to get going with Spark is very low, download the
 latest distribution and start the Spark shell.  Language bindings for Scala
 / Java / Python are excellent meaning you spend less time writing
 boilerplate code, and more time solving problems.

 Even if you believe you *need* to use native code to do something
 specific, like fetching HD video frames from satellite video capture cards
 -- wrap it in a small native library and use the Java Native Access
 interface to call it from your Java/Scala code.

 Have fun, and if you get stuck we're here to help!

 MC


 On 16 June 2014 08:17, Wei Da xwd0...@gmail.com wrote:

 Hi guys,
 We are making choices between C++ MPI and Spark. Is there any official
 comparation between them? Thanks a lot!

 Wei





Re: Hadoop 2.3 Centralized Cache vs RDD

2014-05-16 Thread Bertrand Dechoux
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

We do not currently cache blocks which are under construction, corrupt, or
otherwise incomplete.

Have you tried with a file with more than 1 block?

And dfs.namenode.path.based.cache.refresh.interval.ms might be too large?

You might want to ask a broader mailing list. This is not related to Spark.

Bertrand


On Fri, May 16, 2014 at 2:56 AM, hequn cheng chenghe...@gmail.com wrote:

 I tried centralized cache step by step following the apache hadoop oficial
 website, but it seems centralized cache doesn't work.
 see :
 http://stackoverflow.com/questions/22293358/centralized-cache-failed-in-hadoop-2-3
 .
 Can anyone succeed?


 2014-05-15 5:30 GMT+08:00 William Kang weliam.cl...@gmail.com:

 Hi,
 Any comments or thoughts on the implications of the newly released
 feature from Hadoop 2.3 on the centralized cache? How different it is from
 RDD?

 Many thanks.


 Cao





Re: Real world

2014-05-16 Thread Bertrand Dechoux
http://spark-summit.org ?

Bertrand


On Thu, May 8, 2014 at 2:05 AM, Ian Ferreira ianferre...@hotmail.comwrote:

 Folks,

 I keep getting questioned on real world experience of Spark as in mission
 critical production deployments. Does anyone have some war stories to share
 or know of resources to review?

 Cheers
 - Ian



Re: PySpark still reading only text?

2014-04-22 Thread Bertrand Dechoux
Cool, thanks for the link.

Bertrand Dechoux


On Mon, Apr 21, 2014 at 7:31 PM, Nick Pentreath nick.pentre...@gmail.comwrote:

 Also see: https://github.com/apache/spark/pull/455

 This will add support for reading sequencefile and other inputformat in
 PySpark, as long as the Writables are either simple (primitives, maps and
 arrays of same), or reasonably simple Java objects.

 I'm about to push a change from MsgPack to Pyrolite for the serialization.

 Support for saving as sequencefile or inputformat could then also come
 after that. It would be based on saving the python pickle able objects as
 sequence file and being able to read those back.
 —
 Sent from Mailbox https://www.dropbox.com/mailbox for iPhone


 On Thu, Apr 17, 2014 at 11:40 AM, Bertrand Dechoux decho...@gmail.comwrote:

 According to the Spark SQL documentation, indeed, this project allows
 python to be used while reading/writing table ie data which not necessarily
 in text format.

 Thanks a lot!

 Bertrand Dechoux


 On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux decho...@gmail.comwrote:

 Thanks for the IRA reference. I really need to look at Spark SQL.

 Am I right to understand that due to Spark SQL, hive data can be read
 (and it does not need to be a text format) and then 'classical' Spark can
 work on this extraction?

 It seems logical but I haven't worked with Spark SQL as of now.

 Does it also imply the reverse is true? That I can write data as hive
 data with spark SQL using results from a random (python) Spark application?

 Bertrand Dechoux


 On Thu, Apr 17, 2014 at 7:23 AM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 Yes, this JIRA would enable that. The Hive support also handles HDFS.

  Matei

 On Apr 16, 2014, at 9:55 PM, Jesvin Jose frank.einst...@gmail.com
 wrote:

 When this is implemented, can you load/save an RDD of pickled objects
 to HDFS?


 On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia matei.zaha...@gmail.com
  wrote:

 Hi Bertrand,

 We should probably add a SparkContext.pickleFile and
 RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately
 this is not in yet, but there is an issue up to track it:
 https://issues.apache.org/jira/browse/SPARK-1161.

 In 1.0, one feature we do have now is the ability to load binary data
 from Hive using Spark SQL’s Python API. Later we will also be able to save
 to Hive.

 Matei

 On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux decho...@gmail.com
 wrote:

  Hi,
 
  I have browsed the online documentation and it is stated that
 PySpark only read text files as sources. Is it still the case?
 
  From what I understand, the RDD can after this first step be any
 serialized python structure if the class definitions are well distributed.
 
  Is it not possible to read back those RDDs? That is create a flow to
 parse everything and then, e.g. the next week, start from the binary,
 structured data?
 
  Technically, what is the difficulty? I would assume the code reading
 a binary python RDD or a binary python file to be quite similar. Where can
 I know more about this subject?
 
  Thanks in advance
 
  Bertrand




 --
 We dont beat the reaper by living longer. We beat the reaper by living
 well and living fully. The reaper will come for all of us. Question is,
 what do we do between the time we are born and the time he shows up? -Randy
 Pausch








PySpark still reading only text?

2014-04-16 Thread Bertrand Dechoux
Hi,

I have browsed the online documentation and it is stated that PySpark only
read text files as sources. Is it still the case?

From what I understand, the RDD can after this first step be any serialized
python structure if the class definitions are well distributed.

Is it not possible to read back those RDDs? That is create a flow to parse
everything and then, e.g. the next week, start from the binary, structured
data?

Technically, what is the difficulty? I would assume the code reading a
binary python RDD or a binary python file to be quite similar. Where can I
know more about this subject?

Thanks in advance

Bertrand


Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Bertrand Dechoux
I don't know the Spark issue but the Hadoop context is clear.

old api - org.apache.hadoop.mapred
new api - org.apache.hadoop.mapreduce

You might only need to change your import.

Regards

Bertrand


On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre pbarapa...@gmail.com
 wrote:

 Hi,

 Trying to read HDFS file with TextInputFormat.

 scala import org.apache.hadoop.mapred.TextInputFormat
 scala import org.apache.hadoop.io.{LongWritable, Text}
 scala val file2 =
 sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat](hdfs://
 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt)


 This is giving me the error.

 console:14: error: type arguments
 [org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.mapred.TextInputFormat]
 conform to the bounds of none of the overloaded alternatives of
  value newAPIHadoopFile: [K, V, F :
 org.apache.hadoop.mapreduce.InputFormat[K,V]](path: String, fClass:
 Class[F], kClass: Class[K], vClass: Class[V], conf:
 org.apache.hadoop.conf.Configuration)org.apache.spark.rdd.RDD[(K, V)] and
 [K, V, F : org.apache.hadoop.mapreduce.InputFormat[K,V]](path:
 String)(implicit km: scala.reflect.ClassTag[K], implicit vm:
 scala.reflect.ClassTag[V], implicit fm:
 scala.reflect.ClassTag[F])org.apache.spark.rdd.RDD[(K, V)]
val file2 =
 sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat](hdfs://
 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt)


 What is correct syntax if I want to use TextInputFormat.

 Also, how to use customInputFormat. Very silly question but I am not sure
 how and where to keep jar file containing customInputFormat class.

 Thanks
 Pariksheet



 --
 Cheers,
 Pari