Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/377#issuecomment-43048752
  
    I don't have one, my apologies.  I played around with a generated file like
    below to see what I could come up but didn't find anything..
    
    ```
    $ perl -e 'print ((("a" x 200) . "\n") x 1000000)' | hadoop fs -put -
    /tmp/file.txt
    $ hadoop fs -ls /tmp/file.txt
    Found 1 items
    -rw-r--r--   3 user group  201000000 2014-05-13 23:26 /tmp/file.txt
    $ ./bin/spark-shell
    scala> val f = sc.textFile("hdfs:///tmp/file.txt").cache
     14/05/13 23:31:38 INFO storage.MemoryStore: ensureFreeSpace(80202) called
    with curMem=0, maxMem=206150041
    14/05/13 23:31:38 INFO storage.MemoryStore: Block broadcast_0 stored as
    values to memory (estimated size 78.3 KB, free 196.5 MB)
    f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
    <console>:12
    
    scala> f.count
    14/05/13 23:31:43 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
    14/05/13 23:31:43 INFO lzo.LzoCodec: Successfully loaded & initialized
    native-lzo library [hadoop-lzo rev 54497d3f865a6bf89abd843e1d8441f84d844458]
    14/05/13 23:31:43 INFO mapred.FileInputFormat: Total input paths to process
    : 1
    14/05/13 23:31:43 INFO spark.SparkContext: Starting job: count at
    <console>:15
    14/05/13 23:31:43 INFO scheduler.DAGScheduler: Got job 0 (count at
    <console>:15) with 2 output partitions (allowLocal=false)
    14/05/13 23:31:43 INFO scheduler.DAGScheduler: Final stage: Stage 0(count
    at <console>:15)
    14/05/13 23:31:43 INFO scheduler.DAGScheduler: Parents of final stage:
    List()
    14/05/13 23:31:43 INFO scheduler.DAGScheduler: Missing parents: List()
    14/05/13 23:31:43 INFO scheduler.DAGScheduler: Submitting Stage 0
    (MappedRDD[1] at textFile at <console>:12), which has no missing parents
    14/05/13 23:31:43 INFO scheduler.DAGScheduler: Submitting 2 missing tasks
    from Stage 0 (MappedRDD[1] at textFile at <console>:12)
    14/05/13 23:31:43 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0
    with 2 tasks
    14/05/13 23:31:43 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID
    0 on executor 2: machine06.localdomain (NODE_LOCAL)
    14/05/13 23:31:43 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as
    1777 bytes in 2 ms
    14/05/13 23:31:43 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID
    1 on executor 1: machine04.localdomain (NODE_LOCAL)
    14/05/13 23:31:43 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as
    1777 bytes in 0 ms
    14/05/13 23:31:46 INFO storage.BlockManagerInfo: Added rdd_1_1 in memory
    on machine04.localdomain:52794 (size: 225.2 MB, free: 12.0 GB)
    14/05/13 23:31:46 INFO storage.BlockManagerInfo: Added rdd_1_0 in memory
    on machine06.localdomain:2364 (size: 225.2 MB, free: 12.0 GB)
    14/05/13 23:31:46 INFO scheduler.TaskSetManager: Finished TID 1 in 2214 ms
    on machine04.localdomain (progress: 1/2)
    14/05/13 23:31:46 INFO scheduler.DAGScheduler: Completed ResultTask(0, 1)
    14/05/13 23:31:46 INFO scheduler.DAGScheduler: Completed ResultTask(0, 0)
    14/05/13 23:31:46 INFO scheduler.TaskSetManager: Finished TID 0 in 2279 ms
    on machine06.localdomain (progress: 2/2)
    14/05/13 23:31:46 INFO scheduler.DAGScheduler: Stage 0 (count at
    <console>:15) finished in 2.289 s
    14/05/13 23:31:46 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0,
    whose tasks have all completed, from pool
    14/05/13 23:31:46 INFO spark.SparkContext: Job finished: count at
    <console>:15, took 2.351118993 s
    res0: Long = 1000000
    
    scala>
    ```
    
    
    On Tue, May 13, 2014 at 9:13 PM, Shivaram Venkataraman <
    [email protected]> wrote:
    
    > @ash211 <https://github.com/ash211> 
@pwendell<https://github.com/pwendell>Do you have a simple test case to show 
textFile memory usage being greater
    > than expected ? There might have been some JVM changes which affects the
    > SizeEstimator, but its easier to debug if we have a simple testcase.
    >
    > —
    > Reply to this email directly or view it on 
GitHub<https://github.com/apache/spark/pull/377#issuecomment-43040456>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to