Following the discussion and recommendations by the link you provided, we ran tests with disabled constraint propagation, using the following option: spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) The resulting measurements are on the plot: <http://apache-spark-developers-list.1001551.n3.nabble.com/file/t3091/disable_constraint_propagation_comparison.jpg> Unfortunately this does not provide any difference for the described test jobs. To our understanding, this is because our test jobs do not include any iterations or complex transformations (just read file and count). For the same reason, it is not clear how checkpointing or caching could be applied here.
The transformation plan is trivial, and it seems not being changed by the Catalyst optimizer at different optimization stages. Here is the output of DataFrame.explain(): == Parsed Logical Plan == Relation[key#10,id#11,version#12,Date#13,value#14,address#15,col7_date#16,col8_float#17,col9_boolean#18,col10_string#19,col11_int#20,col12_date#21,col13_float#22,col14_boolean#23,col15_string#24,col16_int#25,col17_date#26,col18_float#27,col19_boolean#28,col20_string#29,col21_int#30,col22_date#31,col23_float#32,col24_boolean#33,col25_string#34,col26_int#35,col27_date#36,col28_float#37,col29_boolean#38,col30_string#39,col31_int#40,col32_date#41,col33_float#42,col34_boolean#43,col35_string#44,col36_int#45,col37_date#46,col38_float#47,col39_boolean#48,col40_string#49,col41_int#50,col42_date#51,col43_float#52,col44_boolean#53,col45_string#54,col46_int#55,col47_date#56,col48_float#57,col49_boolean#58,col50_string#59,col51_int#60,col52_date#61,col53_float#62,col54_boolean#63,col55_string#64,col56_int#65,col57_date#66,col58_float#67,col59_boolean#68,col60_string#69,col61_int#70,col62_date#71,col63_float#72,col64_boolean#73,col65_string#74,col66_int#75,col67_date#76,col68_float#77,col69_boolean#78,col70_string#79,col71_int#80,col72_date#81,col73_float#http://apache-spark-developers-list.1001551.n3.nabble.com/file/t3091/readFormat_visualVM_sampler.jpg82,col74_boolean#83,col75_string#84,col76_int#85,col77_date#86,col78_float#87,col79_boolean#88,col80_string#89,col81_int#90,col82_date#91,col83_float#92,col84_boolean#93,col85_string#94,col86_int#95,col87_date#96,col88_float#97,col89_boolean#98,col90_string#99,col91_int#100,col92_date#101,col93_float#102,col94_boolean#103,col95_string#104,col96_int#105,col97_date#106,col98_float#107,col99_boolean#108,col100_string#109] csv == Analyzed Logical Plan == key: string, id: string, version: string, Date: string, value: string, address: string, col7_date: string, col8_float: string, col9_boolean: string, col10_string: string, col11_int: string, col12_date: string, col13_float: string, col14_boolean: string, col15_string: string, col16_int: string, col17_date: string, col18_float: string, col19_boolean: string, col20_string: string, col21_int: string, col22_date: string, col23_float: string, col24_boolean: string, col25_string: string, col26_int: string, col27_date: string, col28_float: string, col29_boolean: string, col30_string: string, col31_int: string, col32_date: string, col33_float: string, col34_boolean: string, col35_string: string, col36_int: string, col37_date: string, col38_float: string, col39_boolean: string, col40_string: string, col41_int: string, col42_date: string, col43_float: string, col44_boolean: string, col45_string: string, col46_int: string, col47_date: string, col48_float: string, col49_boolean: string, col50_string: string, col51_int: string, col52_date: string, col53_float: string, col54_boolean: string, col55_string: string, col56_int: string, col57_date: string, col58_float: string, col59_boolean: string, col60_string: string, col61_int: string, col62_date: string, col63_float: string, col64_boolean: string, col65_string: string, col66_int: string, col67_date: string, col68_float: string, col69_boolean: string, col70_string: string, col71_int: string, col72_date: string, col73_float: string, col74_boolean: string, col75_string: string, col76_int: string, col77_date: string, col78_float: string, col79_boolean: string, col80_string: string, col81_int: string, col82_date: string, col83_float: string, col84_boolean: string, col85_string: string, col86_int: string, col87_date: string, col88_float: string, col89_boolean: string, col90_string: string, col91_int: string, col92_date: string, col93_float: string, col94_boolean: string, col95_string: string, col96_int: string, col97_date: string, col98_float: string, col99_boolean: string, col100_string: string Relation[key#10,id#11,version#12,Date#13,value#14,address#15,col7_date#16,col8_float#17,col9_boolean#18,col10_string#19,col11_int#20,col12_date#21,col13_float#22,col14_boolean#23,col15_string#24,col16_int#25,col17_date#26,col18_float#27,col19_boolean#28,col20_string#29,col21_int#30,col22_date#31,col23_float#32,col24_boolean#33,col25_string#34,col26_int#35,col27_date#36,col28_float#37,col29_boolean#38,col30_string#39,col31_int#40,col32_date#41,col33_float#42,col34_boolean#43,col35_string#44,col36_int#45,col37_date#46,col38_float#47,col39_boolean#48,col40_string#49,col41_int#50,col42_date#51,col43_float#52,col44_boolean#53,col45_string#54,col46_int#55,col47_date#56,col48_float#57,col49_boolean#58,col50_string#59,col51_int#60,col52_date#61,col53_float#62,col54_boolean#63,col55_string#64,col56_int#65,col57_date#66,col58_float#67,col59_boolean#68,col60_string#69,col61_int#70,col62_date#71,col63_float#72,col64_boolean#73,col65_string#74,col66_int#75,col67_date#76,col68_float#77,col69_boolean#78,col70_string#79,col71_int#80,col72_date#81,col73_float#82,col74_boolean#83,col75_string#84,col76_int#85,col77_date#86,col78_float#87,col79_boolean#88,col80_string#89,col81_int#90,col82_date#91,col83_float#92,col84_boolean#93,col85_string#94,col86_int#95,col87_date#96,col88_float#97,col89_boolean#98,col90_string#99,col91_int#100,col92_date#101,col93_float#102,col94_boolean#103,col95_string#104,col96_int#105,col97_date#106,col98_float#107,col99_boolean#108,col100_string#109] csv == Optimized Logical Plan == Relation[key#10,id#11,version#12,Date#13,value#14,address#15,col7_date#16,col8_float#17,col9_boolean#18,col10_string#19,col11_int#20,col12_date#21,col13_float#22,col14_boolean#23,col15_string#24,col16_int#25,col17_date#26,col18_float#27,col19_boolean#28,col20_string#29,col21_int#30,col22_date#31,col23_float#32,col24_boolean#33,col25_string#34,col26_int#35,col27_date#36,col28_float#37,col29_boolean#38,col30_string#39,col31_int#40,col32_date#41,col33_float#42,col34_boolean#43,col35_string#44,col36_int#45,col37_date#46,col38_float#47,col39_boolean#48,col40_string#49,col41_int#50,col42_date#51,col43_float#52,col44_boolean#53,col45_string#54,col46_int#55,col47_date#56,col48_float#57,col49_boolean#58,col50_string#59,col51_int#60,col52_date#61,col53_float#62,col54_boolean#63,col55_string#64,col56_int#65,col57_date#66,col58_float#67,col59_boolean#68,col60_string#69,col61_int#70,col62_date#71,col63_float#72,col64_boolean#73,col65_string#74,col66_int#75,col67_date#76,col68_float#77,col69_boolean#78,col70_string#79,col71_int#80,col72_date#81,col73_float#82,col74_boolean#83,col75_string#84,col76_int#85,col77_date#86,col78_float#87,col79_boolean#88,col80_string#89,col81_int#90,col82_date#91,col83_float#92,col84_boolean#93,col85_string#94,col86_int#95,col87_date#96,col88_float#97,col89_boolean#98,col90_string#99,col91_int#100,col92_date#101,col93_float#102,col94_boolean#103,col95_string#104,col96_int#105,col97_date#106,col98_float#107,col99_boolean#108,col100_string#109] csv == Physical Plan == *(1) FileScan csv [key#10,id#11,version#12,Date#13,value#14,address#15,col7_date#16,col8_float#17,col9_boolean#18,col10_string#19,col11_int#20,col12_date#21,col13_float#22,col14_boolean#23,col15_string#24,col16_int#25,col17_date#26,col18_float#27,col19_boolean#28,col20_string#29,col21_int#30,col22_date#31,col23_float#32,col24_boolean#33,col25_string#34,col26_int#35,col27_date#36,col28_float#37,col29_boolean#38,col30_string#39,col31_int#40,col32_date#41,col33_float#42,col34_boolean#43,col35_string#44,col36_int#45,col37_date#46,col38_float#47,col39_boolean#48,col40_string#49,col41_int#50,col42_date#51,col43_float#52,col44_boolean#53,col45_string#54,col46_int#55,col47_date#56,col48_float#57,col49_boolean#58,col50_string#59,col51_int#60,col52_date#61,col53_float#62,col54_boolean#63,col55_string#64,col56_int#65,col57_date#66,col58_float#67,col59_boolean#68,col60_string#69,col61_int#70,col62_date#71,col63_float#72,col64_boolean#73,col65_string#74,col66_int#75,col67_date#76,col68_float#77,col69_boolean#78,col70_string#79,col71_int#80,col72_date#81,col73_float#82,col74_boolean#83,col75_string#84,col76_int#85,col77_date#86,col78_float#87,col79_boolean#88,col80_string#89,col81_int#90,col82_date#91,col83_float#92,col84_boolean#93,col85_string#94,col86_int#95,col87_date#96,col88_float#97,col89_boolean#98,col90_string#99,col91_int#100,col92_date#101,col93_float#102,col94_boolean#103,col95_string#104,col96_int#105,col97_date#106,col98_float#107,col99_boolean#108,col100_string#109] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/F:/20kColumns/var_cols/comma_records200000_narrow_columns100Xrows2000.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:string,id:string,version:string,Date:string,value:string,address:string,col7_date:stri... 00 Relation[key#10,id#11,version#12,Date#13,value#14,address#15,col7_date#16,col8_float#17,col9_boolean#18,col10_string#19,col11_int#20,col12_date#21,col13_float#22,col14_boolean#23,col15_string#24,col16_int#25,col17_date#26,col18_float#27,col19_boolean#28,col20_string#29,col21_int#30,col22_date#31,col23_float#32,col24_boolean#33,col25_string#34,col26_int#35,col27_date#36,col28_float#37,col29_boolean#38,col30_string#39,col31_int#40,col32_date#41,col33_float#42,col34_boolean#43,col35_string#44,col36_int#45,col37_date#46,col38_float#47,col39_boolean#48,col40_string#49,col41_int#50,col42_date#51,col43_float#52,col44_boolean#53,col45_string#54,col46_int#55,col47_date#56,col48_float#57,col49_boolean#58,col50_string#59,col51_int#60,col52_date#61,col53_float#62,col54_boolean#63,col55_string#64,col56_int#65,col57_date#66,col58_float#67,col59_boolean#68,col60_string#69,col61_int#70,col62_date#71,col63_float#72,col64_boolean#73,col65_string#74,col66_int#75,col67_date#76,col68_float#77,col69_boolean#78,col70_string#79,col71_int#80,col72_date#81,col73_float#82,col74_boolean#83,col75_string#84,col76_int#85,col77_date#86,col78_float#87,col79_boolean#88,col80_string#89,col81_int#90,col82_date#91,col83_float#92,col84_boolean#93,col85_string#94,col86_int#95,col87_date#96,col88_float#97,col89_boolean#98,col90_string#99,col91_int#100,col92_date#101,col93_float#102,col94_boolean#103,col95_string#104,col96_int#105,col97_date#106,col98_float#107,col99_boolean#108,col100_string#109] csv However, to our observations, most of the time is spent in the driver. The computation itself takes little time. Here is the timeline <nabble_img src="timeline.png" border="0"/> Java Visual VM sampler shows that most of the time is spent in Catalyst. About 94% of the time is spent in method catalyst.plans.logical.LogicalPlan.resolve() and 36% in catalyst.plans.logical.LogicalPlan$$anonfun$resolve$2.apply() SUMMARY: Based on the above discussion, it is very interesting why the time to instantiate a DataFrame has polynomial dependency on the number of columns (the same was also observed for basic transformations). Could this be performed in linear time, as in the second testing job with spark.createDataFrame(data,schema)? In both cases the schema (all columns are strings), content and output are exactly the same. Could this polynomial dependence be caused by Catalyst, bug in Spark core, code generation or JVM setup? Or are we missing something? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org